Today in San Francisco, California, AMD held a special event where we announced the newest additions to the Radeon Instinct™ family of compute products. The AMD Radeon Instinct™ MI60 and Radeon Instinct™ MI50 accelerators are the first GPUs in the world that are based on the advanced 7nm FinFET process technology. The ability to go down to 7nm allows us to put more transistors on to an even smaller package than was possible before – in this case, the MI60 contains 13.2 billion transistors on a package size of 331.46mm2, while the previous generation Radeon Instinct™ MI25 had 12.5 billion transistors on a package size of 494.8mm2 – a 58% improvement in number of transistors per mm2. This allows us to provide a more powerful and robust product, capable of tackling a wide range of workloads from training and inference, to high performance computing.

We’ve made numerous improvements on these new products, including optimized deep learning operations. In addition to native half-precision (FP16) performance, the MI60 and MI50 now support INT8 and INT4 operations, delivering up to a whopping 118 TFLOPS of INT4 peak performance on the MI60. The supercharged compute capabilities of these new products are designed to meet today’s demanding system requirements of handling large data efficiently for training complex neural networks and running inference against those neural networks used in deep learning.

World’s Fastest Double Precision PCIe® Based Accelerator

On the other end of the compute spectrum are FP64 calculations primarily used in high performance compute workloads. These types of workloads require extreme accuracy and speed, which the MI60 and MI50 deliver. The Radeon Instinct MI60 is the fastest double precision PCIe® based accelerator1, delivering up to 7.4 TFLOPS of FP64 peak performance, while the MI50 is not far behind at 6.7 TFLOPS. In addition to fast FP64 performance, the MI60 and MI50 both sport full-chip ECC memory3 as well as RAS4. This allows scientists and researchers across several industries including life sciences, energy, automotive and aerospace, government and more to achieve results with both speed and accuracy.

Finely Balanced, Ultra-Scalable Datacenter Solution

Most of the improvements we’ve talked about so far have been at the chip level, but we didn’t stop there. We also have a number of new benefits found beyond the chip as well. We meticulously designed the MI60 and MI50 to deliver finely tuned and balanced performance. We took a look at some of the common bottlenecks found in previous generations and made improvements to ensure your data is processed in the most efficient manner possible. This includes making these cards PCIe® Gen 4* capable, delivering up to 2x more bandwidth (64 GB/s vs. 32 GB/s) than PCIe® Gen 3 when communicating over the bus. In addition to improved performance between GPU and CPU, we’ve also built in to these products a peer-to-peer GPU communication feature called Infinity Fabric™ Link technology. Each card includes two physical Infinity Fabric™ Links allowing you to directly connect four GPUs together in a GPU hive ring and up to two of these hives in an 8 GPU server. Each GPU card provides up to 200 GB/s bandwidth between peer GPUs, which is up to 6x faster than PCIe Gen 3 alone2. We have also doubled memory bandwidth speeds from our previous generation Radeon Instinct MI25 accelerator5, delivering up to 1TB/s memory bandwidth on both the MI50 and MI60 accelerator – the first GPUs to achieve this speed.

With improved performance from both within the GPU and between GPUs and CPUs, these new finely-balanced, ultra-fast and scalable solutions are the ideal datacenter compute solution for all your needs whether they’re inference, training or HPC related.

Warren Eng is a Product Marketing Manager for professional graphics and compute at AMD. His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied. GD-5

This year at SC17, AMD showcased Radeon Instinct™ accelerators, AMD EPYC™ processors and the ROCm open software platform – a complete ecosystem to drive a new era in the datacenter. Our booth was packed with server racks from partners like Inventec, Gigabyte, Supermicro and BOXX. Attendees had the opportunity to check out Project 47, both on display and running demos, offering 1 PetaFLOPS of compute power.

The much anticipated TensorFlow support with ROCm 1.7 was revealed in our booth alongside a demo of deep learning inference from a trained Caffe model. AMD also offered hourly Tech Talks, diving into a wide range of topics – from AMD EPYC™ performance to Radeon technology powering the exploration of dark energy with the CHIME radio telescope.

Thank you to everyone that joined us at SC17. For those that were unable to attend, check out our photo gallery below. We hope to see you next year at SC18!

Daniel Skrba, Marketing and Communications Specialist for the Radeon Technologies Group at AMD. His postings are his own opinions and may not represent AMD’s positions, strategies, or opinions. Links to third party sites and references to third party trademarks are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.

Visit AMD at our SC17 booth #825 and learn how AMD together with our partners is bringing about a new era in the datacenter that is revolutionizing High Performance Computing with our new AMD EPYC™ processors and Radeon Instinct™ accelerators. On top of this year’s show stopping demos, you will have the opportunity to attend one of our interactive and educational booth Tech Talks – check out the schedule below.

AMD is excited to see the emergence of the Open Neural Network Exchange (ONNX) format which is creating a common format model to bridge three industry-leading deep learning frameworks (PyTorch, Caffe2, and Cognitive Toolkit) to give our customers simpler paths to explore their networks via rich framework interoperability.

The ONNX format, via its extensible computation graph model, built-in operators, and standard data types will allow our team to focus on more in-depth optimization with our Radeon Instinct Hardware and more productive solution set via our open source MIOpen deep learning solver library and ROCm Compiler technology. It also gives us the path to explore new foundation production beyond traditional frameworks for production to bring lighter weight more optimized solutions for our hardware.

It is great to see the collaboration of Facebook and Microsoft continuing to also follow in the path of open software development practice with ONNX, building on their open source projects PyTorch, Caffe2, and Cognitive Toolkit. Open Software development aligns with our philosophy of bringing out open source software platform, tools, and driver to allow the research community to have more powerful ability to explore broader deep learning design space.

We feel this is an excellent step for the community to open up these platform to a broader set of diverse architectures. We look forward to working with the project and help it grow in the coming months.

Gregory Stoner, is Sr. Director of Radeon Open Compute. Links to third-party sites and references to third-party trademarks are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third-party endorsement of AMD or any of its products is implied. Use of third-party names or marks is for informational purposes only and no endorsement of or by AMD is intended or implied.

Deep Learning, an advanced form of machine learning, has generated a lot of interest due to the wide range of applications on complex data sets. Current technologies and the availability of very large amounts of complex data have made analytics on the latter more tractable.

With deep neural networks as basis for deep learning algorithms, GPUs are now being used in deep learning applications because they provide many processing units. These processing units simulate a neural network that does the computation on data. Neural networks can therefore scale and improve the extraction of information from data.

ROCm and The AMD Deep Learning Stack

The AMD Deep Learning Stack is the result of AMD’s initiative to enable DL applications using their GPUs such as the Radeon Instinct product line. Currently, deep learning frameworks such as Caffe, Torch, and TensorFlow are being ported and tested to run on the AMD DL stack. Supporting these frameworks is MIOpen, AMD’s open-source deep learning library built for the Radeon Instinct line of compute accelerators.

AMD’s ROCm platform serves as the foundation of this DL stack. ROCm enables the seamless integration of the CPU and GPU for high performance computing (HPC) and ultra-scale class computing. To achieve this, ROCm is built for language independence and takes advantage of the Heterogenous System Architecture (HSA) Runtime API.3 This is the basis of the ROCr System Runtime, a thin user-mode API providing access to graphics hardware driven by the AMDGPU driver and the ROCk kernel driver.

For now, OS support for ROCm is limited to Ubuntu 14.04, Ubuntu 16.04, and Fedora 23. For these OSs, AMD provides a modified Linux version 4.6 kernel with patches to the HSA kernel driver (amdkfd) and the AMDGPU (amdgpu) kernel driver currently in the mainline Linux kernel.5

Using Docker With The AMD Deep Learning Stack

Docker Containers

Software containers isolate the application and its dependencies from other software installed on the host. They abstract the underlying operating system while keeping its own resources (filesystem, memory, CPU) and environment separate from other containers.

In contrast to virtual machines, all containers running on the same host share a single operating system without the need to virtualize a complete machine with its own OS. This makes software containers perform much faster than virtual machines because of the lack of overhead from the guest OS and the hypervisor.

Docker is the most popular software container platform today. It is available for Linux, macOS, and Microsoft Windows. Docker containers can run under any OS with the Docker platform installed.6

Installing Docker and The AMD Deep Learning Stack

The ROCm-enabled Linux kernel and the ROCk driver, together with other needed kernel modules, must be installed on all hosts that run Docker containers. This is because the containers do not have the kernel installed inside them. Instead, the containers share the host kernel.7

The installation procedure described here is for Ubuntu 16.04. Ubuntu 16.04 is currently the most tested OS for ROCm.

We are installing the Docker Community Edition (also called Docker CE) on the host by using Docker’s apt repository. Our procedure is based on documentation published by Docker.8 There may be some slight differences from the original documentation. Note that the installation is done as the superuser. You can also use sudo to install Docker.

First, remove old versions of Docker:

apt remove docker docker-engine

If they are not installed, you will simply get a message that they are missing.

Installing Docker CE should be done with ‘apt install docker-ce’. After the installation is complete, verify that Docker is properly configured and installed using the command ‘docker run hello-world’.

Running ROCm Docker Images

AMD provides a Docker image of the ROCm software framework.9 The image can be pulled from the official Docker repository:

sudo docker pull rocm/rocm-terminal

The image is about 1.5 GB in size and contains the necessary libraries to run ROCm-based applications. Create a container out of this image and look at the installed software in /opt/rocm:

sudo docker run -it –rm –device=/dev/kfd rocm/rocm-terminal

You can check for the ROCm libraries using ldconfig:

ldconfig -NXv

The command above should list all the libraries in the library path including the ROCm libraries.

We can use the rocm/rocm-terminal Docker image to build our own ROCm application Docker image. In the following examples, we use a couple of the sample applications that come with the

ROCm development package. One of them shall be /opt/rocm/hip/samples/1_Utils/hipInfo.

Assuming the host has the complete ROCm development tools, we just do the following:

cd /opt/rocm/hip/samples/1_Utils/hipInfo

make

The outcome of the make command shall be a binary called hipInfo.

If the compiler complains because of a missing shared library called libsupc++, we will need to install that somewhere in the host’s library path. In our case, we shall place the shared library in /usr/local/lib and make sure that ldconfig can find it. You can simply create a shared library from the installed static library /usr/lib/gcc/x86_64-linux-gnu/4.8/libsupc++.a:

mkdir -p ~/tmp/libsupc++

cd ~/tmp/libsupc++

ar x /usr/lib/gcc/x86_64-linux-gnu/4.8/libsupc++.a

ls -l *.o

gcc -shared -o libsupc++.so *.o

sudo cp -p libsupc++.so /usr/local/lib/

sudo ldconfig -v

Make sure that /usr/local/lib is seen by ldconfig. You may have to specify this directory in /etc/ld.so.conf.d if it is not found. Simply add a file named local_lib.conf with the line /usr/local/lib by itself.

Check the output of hipInfo by running it. You should get something like this (it will be slightly different from the literal output below depending on what type of GPU configuration you have):

Now that hipInfo is compiled and has been tested, let us create a Docker image with it. Create a directory for building an image with Docker.

mkdir ~/tmp/my_rocm_hipinfo

cd ~/tmp/my_rocm_hipinfo

Copy the necessary files for the Docker image to run properly:

cp -p /usr/local/lib/libsupc++.so . # If hipInfo needs this

cp -p /opt/rocm/hip/samples/1_Utils/hipInfo/hipInfo .

Create a file named Dockerfile in the current directory. It should contain this:

FROM rocm/rocm-terminal:latest

COPY libsupc++.so /usr/local/lib/

COPY hipInfo /usr/local/bin/

RUN sudo ldconfig

USER rocm-user

WORKDIR /home/rocm-user

ENV PATH “${PATH}:/opt/rocm/bin:/usr/local/bin”

ENTRYPOINT [“hipInfo”]

Build the Docker image:

sudo docker build -t my_rocm_hipinfo .

Create and run a container based on the new image:

sudo docker run –rm –device=”/dev/kfd” my_rocm_hipinfo

The device /dev/kfd is the kernel fusion driver. You should be getting a similar output as if you ran the hipInfo binary directly on the host.

Without the –rm parameter, the container will persist. You can then run the same container again and get some output:

sudo docker run –device=”/dev/kfd” –name nifty_hugle my_rocm_hipinfo

The Docker container shall persist:

sudo docker ps -a

You may get an output that looks like this:

Now, try this command and you should see the output from hipInfo again:

sudo docker start -i nifty_hugle

The second Docker image we shall create will contain the sample binary called vector_copy. The source is in /opt/rocm/hsa/sample. As done with hipInfo, use make to build the binary. Note that this binary also depends on the files with the .brig extension to run.

We do the following before we build the image:

mkdir ~/tmp/my_rocm_vectorcopy

cd ~/tmp/my_rocm_vectorcopy

mkdir vector_copy

cp -p /usr/local/lib/libsupc++.so . # Do this if necessary

cd vector_copy

cp -p /opt/rocm/hsa/sample/vector_copy .

cp -p /opt/rocm/hsa/sample/vector_copy*.brig .

cd .. # Back to ~/tmp/my_rocm_vectorcopy

For our Dockerfile, we have this:

FROM rocm/rocm-terminal:latest

COPY libsupc++.so /usr/local/lib/

RUN sudo mkdir /usr/local/vector_copy

COPY vector_copy/* /usr/local/vector_copy/

RUN sudo ldconfig

USER rocm-user

ENV PATH “${PATH}:/opt/rocm/bin:/usr/local/vector_copy”

WORKDIR /usr/local/vector_copy

ENTRYPOINT [“vector_copy”]

Building the Docker image for vector_copy should be familiar by now.

As an exercise, run the Docker image to see what output you get. Try with or without –rm and with the ‘docker start’ command.

For our last example, we shall use a Docker container for the Caffe deep learning framework. We are going to use the HIP port of Caffe which can be targeted to both AMD ROCm and Nvidia CUDA devices.10 Converting CUDA code to portable C++ is enabled by HIP. For more information on HIP, seehttps://github.com/ROCm-Developer-Tools/HIP.

Let us pull the hip-caffe image from the Docker registry:

docker pull intuitionfabric/hip-caffe

Test the image by running a device query on the AMD GPUs:

sudo docker run –name my_caffe -it –device=/dev/kfd –rm \

intuitionfabric/hip-caffe ./build/tools/caffe device_query -gpu all

You should get an output similar to the one below. Note that your output may differ due to your own host configuration.

In this article, we provided with you a guide on how to use AMD’s ROCm framework with Docker container technology. This should serve as a good jumpstart to begin your Deep Learning development using AMDs platform.

Docker has become an essential technology in containing the complexity of Deep Learning development. Deep Learning frameworks and tools have many dependencies. By leveraging Docker to isolate these dependencies within a Linux container leads to not only greater reliability and robustness but also to greater agility and flexibility. There are many frameworks and tools that are emerging and it is best practice to have a robust solution to the management of disparate parts. Docker containers have become a standard practice in Deep Learning and this technology is well supported by AMD’s ROCm framework.

Back in December 2016, we first announced our Radeon Instinct initiative, combining our strength in compute with our dedication to open software. We later announced our Radeon Vega Frontier Edition, an enabler of Radeon Instinct.

Today, we’re excited to tell you about the next chapter in our vision for instinctive computing. AMD’s Radeon Instinct™ accelerators will soon ship to our partners (including Boxx, Colfax, Exxact Corporation, Gigabyte, Inventec and Supermicro, among others) and power their deep learning and HPC solutions starting in Q3 2017.

Artificial intelligence and machine learning are changing the world in ways we never could have imagined only a few years ago, enabling life-changing breakthroughs that can solve previously unsolvable problems. Radeon Instinct™ MI25, MI8, and MI6, together with AMD’s open ROCm 1.6 software platform, can dramatically increase performance, efficiency, and ease of implementation, speeding through deep learning inference and training workloads. We’re not just looking to accelerate the drive to machine intelligence, but to power the next era of true heterogeneous compute.

New Radeon Instinct Accelerators

Through our Radeon Instinct server accelerator products and open ecosystem approach, we’re able to offer our customers cost-effective machine and deep learning training, edge-training and inference solutions, where workloads can take the most advantage of the GPU’s highly parallel computing capabilities.

We’ve also designed the three initial Radeon Instinct accelerators to address a wide range of machine intelligence applications, which includes data-centric HPC-class systems in academics, government labs, energy, life science, financial, automotive and other industries:

The Radeon Instinct™ MI25 accelerator, based on the new “Vega” GPU architecture with a 14nm FinFET process, will be the world’s ultimate training accelerator for large-scale machine intelligence and deep learning datacenter applications. The MI25 will deliver superior FP16 and FP32 performance in a passively-cooled single GPU server card with 24.6 TFLOPS of FP16 or 12.3 TFLOPS of FP32 peak performance through its 64 compute units (4,096 stream processors). With 16GB of ultra–high bandwidth HBM2 ECC GPU memory and up to 484 GB/s of memory bandwidth, the Radeon Instinct MI25’s design is optimized for massively parallel applications with large datasets for Machine Intelligence and HPC-class systems.

The Radeon Instinct™ MI8 accelerator, harnessing the high-performance, energy-efficiency of the “Fiji” GPU architecture, is a small form factor HPC and inference accelerator with 8.2 TFLOPS of peak FP16|FP32 performance at less than 175W board power and 4GB of High-Bandwidth Memory (HBM) on a 512-bit memory interface. The MI8 is well suited for machine learning inference and HPC applications.

The Radeon Instinct™ MI6 accelerator, based on the acclaimed “Polaris” GPU architecture, is a passively cooled inference accelerator with 5.7 TFLOPS of peak FP16|FP32 performance at 150W board power and 16GB of ultra-fast GDDR5 GPU memory on a 256-bit memory interface. The MI6 is a versatile accelerator ideal for HPC and machine learning inference and edge-training deployments.

Planned for June 29th rollout, the ROCm 1.6 software platform with performance improvements and now support for MIOpen 1.0 is scalable and fully open source providing a flexible, powerful heterogeneous compute solution for a new class of hybrid Hyperscale and HPC-class systems. Comprised of an open-source Linux® driver optimized for scalable multi-GPU computing, the ROCm software platform provides multiple programming models, the HIP CUDA conversion tool, and support for GPU acceleration using the Heterogeneous Computing Compiler (HCC).

The open-source MIOpen GPU-accelerated library available June 29th with the ROCm platform and supports machine intelligence frameworks including planned support of Caffe®, TensorFlow® and Torch®.

The Radeon Instinct MI25, combined with our new “Zen”-based Epyc servers and the revolutionary ROCm open software platform, will provide a progressive approach to open heterogeneous compute and machine learning from the metal forward.

We plan to ship Radeon Instinct products to our technology partners in Q3 for design in their deep learning and HPC solutions, giving customers a real choice of vendors for open, scale-out machine learning solutions.

For more details and specifications on these cards, please check out the product pages below.

1 PetaFLOPS of Performance for the Ultimate Virtualization and Machine Intelligence Solution

Today at Capsaicin SIGGRAPH, AMD showcased what can be achieved when the world’s greatest server CPU is combined with the world’s greatest GPU, based on AMD’s revolutionary “Vega” architecture. Developed by AMD in collaboration with Inventec, Project 47 is based on Inventec’s P-series massively parallel computing platform, and is a rack designed to excel in a range of tasks, from graphics virtualization to machine intelligence.

Project 47 boasts 1 PetaFLOPS of compute power at full 32-bit precision delivering a stunning 30 GigaFLOPS/W, demonstrating dramatic compute efficiency.1 It boasts more cores, threads, compute units, IO lanes and memory channels in use at one time than in any other similarly configured system ever before. The incredible performance-per-dollar and performance-per-watt of Project 47 makes supercomputing a more affordable reality than ever before, whether for machine learning, virtualization or rendering.

Project 47 is made up of a rack of individual servers, each harnessing one EPYC™ 7601 processor to drive up to four “Vega”-based Radeon Instinct™ MI25 accelerators using 128 PCIe® lanes, in contrast to the costly dual-CPU and PLX switch setups typically needed on competing platforms in order to run four GPUs. With Project 47, AMD showcased the ease with which multiple servers can be daisy-chained, demonstrating a rack of 20 servers running 20 EPYC SoCs and 80 Radeon Instinct MI25 accelerators.

To bring Project 47 to life, AMD worked closely with Samsung Electronics with respect to the HBM2 memory used across the “Vega”-based product lines including the Radeon Instinct MI25 accelerators. Samsung also provided high-performance NVMe SSD storage and high-speed DDR4 memory to enable the 1 PetaFLOPS of performance. AMD also collaborated with Mellanox Technologies, leveraging their InfiniBand solution to deliver 100Gb connectivity through the rack.

Project 47 is expected to be available from Inventec and their principal distributor AMAX in Q4 of this year.

Mark Hirsch, Corporate Vice President, Systems & Solutions for the Radeon Technologies Group at AMD. His postings are his own opinions and may not represent AMD’s positions, strategies, or opinions. Links to third party sites and references to third party trademarks are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.

AMD’s newly released Vega architecture has several unique features that can be leveraged in Deep Learning training and inference workloads.

The first noteworthy feature is the capability to perform FP16 at twice the speed as FP32 and with INT8 at four times as fast as FP32. This translates to a peak performance of 24 teraflops on FP16 and 48 trillion operations per second on INT8. Deep Learning workloads have known to work well with lower precision arithmetic. It is as if AMD architects were aware of this reality and designed VEGA to exploit this characteristic. The second noteworthy feature of Vega is its new memory architecture that permits the addressability of up to 512GB of memory. The third benefit is favorable coupling with AMD’s ThreadRipper and EPYC lines of microprocessors.

On Deep Learning

Deep learning (DL) is a technology that is as revolutionary as the Internet and mobile computing that came before it. The current revival of interest in all things “Artificial Intelligence” (AI) is driven by the spectacular results achieved with deep learning. There are other AI technologies like expert systems, semantic knowledge bases, logic programming and Bayesian systems. Most of classical AI has not changed much, if any, in the last 5 years. The recent quantum leap disproportionately been driven by deep learning progress.

When Google embarked on converting their natural language translation software into using deep learning, they were surprised to discover major gains. This was best described in a recent article published in the New York Times, “The Great AI Awakening”:

The neural system, on the English-French language pair, showed an improvement over the old system of seven points. Hughes told Schuster’s team they hadn’t had even half as strong an improvement in their own system in the last four years. To be sure this wasn’t some fluke in the metric, they also turned to their pool of human contractors to do a side-by-side comparison. The user-perception scores, in which sample sentences were graded from zero to six, showed an average improvement of 0.4 — roughly equivalent to the aggregate gains of the old system over its entire lifetime of development. In mid-March, Hughes sent his team an email. All projects on the old system were to be suspended immediately.

Let’s pause to recognize what happened at Google. Since its inception, Google has used every type of AI or machine learning technology imaginable. In spite of this, their average gain for improvement per year was only 0.4%. In Google’s first implementation, the improvement due to DL was 7 percentage points better.

Google likely has the most talented AI and algorithm developers on the planet. However, several years of handcrafted development could not hold a candle to a single initial deep learning implementation.

ROCm

ROCm is software that supports High Performance Computing (HPC) workloads on AMD hardware. ROCm includes a C/C++ compiler called the Heterogeneous Compute Compiler (HCC). HCC is based on the open-source LLVM compiler infrastructure project. This HCC compiler supports the direct generation of native Radeon GPU instruction set (known as GSN ISA). Targeting native GPU instructions is crucial to get maximum performance. All the libraries under ROCm support GSN ISA.

Included with the compiler is an API called HC which provides additional control over synchronization, data movement and memory allocation. The HCC compiler is based on previous work in heterogeneous computing at the HSA foundation. The design allows CPU and GPU code to be written in the same source file and supports capabilities such as a unified CPU-GPU memory space.

The diagram above depicts the relationships between the ROCm components. The HCC compiler generates both the CPU and GPU code. It uses different LLVM back ends to generate x86 and GCN ISA code from a single C/C++ source. A GSN ISA assembler can also [1] be used as a source for the GCN target.

The CPU and GPU code are linked with the HCC runtime to form the application (compare this with HSA diagram). The application communicates with the ROCr driver that resides in user space in Linux. The ROCr driver uses a low latency mechanism (packet based AQL) to coordinate with the ROCk Kernel Driver.

To further narrow the capability gap, the ROCm Initiative created a CUDA porting tool called HIP (let’s ignore what it stands for). HIP provides tooling that scans CUDA source code and converts it into corresponding HIP source code. HIP source code looks similar to CUDA code, but compiled HIP code can support both CUDA and AMD based GPU devices.

The ROCm initiative provides the handcrafted libraries and assembly language tooling that will allow developers to extract every ounce of performance from AMD hardware. This includes a rocBLAS. This is implemented from scratch with a HIP interface. AMD also provides an FFT library called rocFFT that is also written with HIP interfaces. MIOpen is a native library that is tuned for Deep Learning workloads, it is AMD’s alternative to Nvidia’s cuDNN library. This library includes Radeon GPU-specific optimizations.

hipCaffe

AMD currently has ported Caffe to run using the ROCm stack. You can try examples here. I ran some benchmarks found here and here is a chart of the results:

Caffe is run on unspecified GPU hardware.

I don’t know the specific hardware that was used in these benchmarks, however, this comparison does show that the performance improvement is quite significant as compared to alternatives. One thing to observe is that the speedup is most impressive with a complex network like GoogleNet as compared to simpler one like VGG. This is a reflection of the amount of hand-tuning that AMD has done on the MIOpen library.

Deep Learning Standard Virtual Machines

Deep learning frameworks like Caffe have internal computational graphs. These graphs specify the execution order of mathematical operations, similar to a dataflow. These frameworks use the graph to orchestrate its execution on groups of CPUs and GPUs. The execution is parallel and this is one reason why GPUs are ideal for this kind of computation. There are however plenty of untapped opportunities to improve the orchestration between the CPU and GPU.

The current state of Deep Learning frameworks is similar to the fragmented state before the creation of common code generation backends like LLVM. In the chaotic good old days, every programming language had to re-invent its way of generating machine code. With the development of LLVM, many languages now share the same backend code. Many programming languages use LLVM as their backend. Several well-known examples of this are Ada, C#, Common Lisp, Delphi, Fortran, Haskell, Java bytecode, Julia, Lua, Objective-C, Python, R, Ruby, Rust, and Swift. The frontend code only needs to parse and translate source code to an intermediate representation (IR).

Deep Learning frameworks will eventually need their own “IR”. The IR for Deep Learning is, of course, the computational graph. Deep learning frameworks like Caffe and TensorFlow have their own internal computational graphs. These frameworks are all merely convenient fronts to the internal graph. These graphs specify the execution order of mathematical operations, analogous to what a dataflow graph does. The graph specifies the orchestration of collections of CPUs and GPUs. This execution is highly parallel. Parallelism is the one reason why GPUs are ideal for this kind of computation. There are however plenty of untapped opportunities to improve the orchestration between the CPU and GPU.

New research is exploring ways to optimize the computational graph in a way that goes beyond just single device optimization and towards more global multi-device optimization. NNVM is one such framework that performs a computation graph optimization framework using an intermediate representation. The goal is for NNVM optimizers to reduce memory and device allocation while preserving the original computational semantics.

A more recent development is the port of NNVM to support AMD GPUs. The NNVM compiler can compile to the TVM stack. The TVM stack is a compilation an end-to-end compilation stack that supports multiple backends. TVM compiles a high-level computation description written in TVM frontend down to an optimized native GPU code. It leverages an LLVM based code generator in TVM and LLVM’s ROCm capabilities. This new project can be found at:https://github.com/ROCmSoftwarePlatform/nnvm-rocm.

The NNVM and TVM stacks perform optimizations in a global manner across either the computational graph or an alternative declarative specification. Conventional DL frameworks, however, have code generation and execution all intertwined with their code base, making opportunities to develop optimization solutions less portable. Ideally, one would like to see a common standard, a DL virtual machine instruction set, where the community can collective contribute optimization routines. Open Neural Network eXchange (ONNX) is one such standard. ONNX is a project supported by Facebook and Microsoft. They are building support for Caffe2, PyTorch and Cognitive Toolkit. The recent TVM port reveals the potential of AMD support for a wider range of DL frameworks:

TVM transforms the computational graph by minimizing memory, optimizing data layout and fusing computational kernels. It is a reusable framework that is designed to support multiple hardware back-ends. NNVM provides a high-level intermediate representation that represents tasks scheduling and memory management. TVM is a low-level IR for optimizing computation. A proof of concept showed that the approach of optimizing low-level operations lead to around a 35% improvement over hand-engineered kernels. This end-to-end optimization combined with AMD’s open sourced computational libraries like MIOpen is a very promising development.

Conclusion

There are many Deep Learning frameworks in existence today. Different frameworks have their own strengths and weaknesses. The field is making good progress to develop standardization that allows interoperability of these frameworks. This is through a common standard Deep Learning virtual machine. ONNX is one of these more recent standards.

In addition to standardization, global optimization of the computational graph found in Deep Learning frameworks is a means towards higher performance. The TVM framework and its integration with AMD’s LLVM based backend opens up the opportunity for end-to-end optimization of not only AMD GPUs but also the combination of CPUs and GPUs.

The recent release of ROCm 1.6, which includes a cuDNN-like library called MIOpen and a port of the deep learning Caffe framework (the AMD version is called hipCaffe), has opened up the opportunity for running deep learning projects using AMD Radeon GPUs. In this article we demonstrate 6 projects that you can start using with AMDs new hardware accelerators.

Most GPU-enabled deep learning frameworks rely on Nvidia’s CUDA and cuDNN libraries. AMD is however pulling an aggressive effort to port many deep learning frameworks such as Caffe, Torch, MXNet and Tensorflow to run on their hardware. Developers are now able to convert CUDA code to portable C++ code, thanks to AMD’s porting tools and libraries such as HIP.

University of Wyoming’s Evolving AI Lab has a project whose goal is to understand how deep neural networks (DNNs) work by synthesizing preferred stimuli that highly activates the neurons for a particular image. A deep generator network (DGN) is used as prior to the DNN being studied. This DGN outputs a synthetic image very similar to real images from the ImageNet dataset as possible.

Below are a few results from running the sample scripts in the project:

The Caffe project already has pre-trained models (i.e. VGG, ImageNet) that can be used as a starting point for developing other kinds of image classification.

5. Visual Development Environment

Fabrik is an open source application for building, visualizing and training deep learning models. Fabrik provides simple drag-and-drop tools to streamline deep learning development. The application currently supports importing, editing and exporting of Caffe based models. This is a convenient way to view and edit your models.

6. Model Conversion Tools

Finally, there are vastly more projects that have been developed in frameworks other than Caffe. For these projects, there are some tools that can convert models into one that is compatible with Caffe. This GitHub project https://github.com/ysh329/deep-learning-model-convertor provides a listing of conversion tools to convert one frameworks model into another.

In this article, we explore the many deep learning projects that you can now run using AMD Radeon Instinct hardware. We have included in this list, projects that you can test out with minimal effort. There are other projects that have customized Caffe with custom elements like new kinds of layers and activation function. For these projects, one may require porting CUDA specific code using AMD’s HIP tooling. Aside from the projects explored here, you can find other projects in the Caffe Model Zoo (see: https://github.com/BVLC/caffe/wiki/Model-Zoo).

The smartest companies in the world are migrating their infrastructure to support this new paradigm. Daily, the press continues to report the amazing progress of AI. Furthermore, you hear about firms like Google and Microsoft changing their entire software DNA to move into AI. The reason for this massive migration is Deep Learning.

Deep Learning is supporting work by not only providing assistive capabilities, but also by enabling more creative generative capabilities. Assistive capabilities can happen in real time as well as in the backend. There are certain professions where the ability to curate and analyze information is extremely valuable. We can enhance these curation and analysis capabilities by reducing the deluge of information into smaller chunks that are more quickly digestible.

Generative capabilities are a new kind of capability that is becoming more pervasive. By now, we’ve all experienced the capabilities of mobile app Prisma that is able to re-render photographs into the style of different artists.

In this article, we highlighted several deep learning projects that explore both assistive and generative capabilities found in Deep Learning. We also covered some tools that allow you to port models from other projects as well as an IDE. Software that supports Radeon Instinct accelerators is still in its infancy. However, despite being out for just a few months, there are now plenty of interesting applications that can be used as a springboard to developing more complex solutions.

Albert J. De Vera and Carlos E.Perez, are Co-Founders at Intuition Machine. They specializes in Deep Learning patterns, methodology and strategy. Many of their other writings on Artificial Intelligence can be found on Medium. Their postings are their own opinions and may not represent AMD’s positions, strategies, or opinions. Links to third party sites and references to third party trademarks are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.

When a company starts using disruptive technology or a disruptive business model, the results can be spectacular and can leave the competition eating dust.

The reason for this is that although the company’s growth seems linear at first, it eventually reveals itself as being exponential. When a company reaches this point, it becomes very difficult, if not impossible, for competitors to catch up.

This article explores AMD’s open source deep learning strategy and explains the benefits of AMD’s ROCm initiative to accelerating deep learning development. It asks if AMD’s competitors need to be concerned with the disruptive nature of what AMD is doing.

On Deep Learning

Deep learning (DL) is a technology that is as revolutionary as the Internet and mobile computing that came before it. One author found it so revolutionary that he described it as “The Last Invention of Man” [KHAT] – strong words indeed!

Currently, the revival of interest in all things “Artificial Intelligence” (AI) is primarily due to the spectacular results achieved with deep learning research. I must however emphasize that this revival is not due to other classical AI technologies like expert systems, semantic knowledge bases, logic programming or Bayesian systems. Most of classical AI has not changed much, if any, in the last 5 years. The recent quantum leap has solely been driven by deep learning successes.

For some perspective on the extent of deep learning development, look at this graph from Google that shows the adoption of deep learning technology in their applications:

As you can see, the adoption at Google has been exponential and the statistics are likely similar for many of the other big Internet firms like Facebook and Microsoft.

When Google embarked on converting their natural language translation software into using deep learning, they were surprised to discover major gains. This was best described in a recent article published in the New York Times, “The Great AI Awakening” [LEW]:

The neural system, on the English-French language pair, showed an improvement over the old system of seven points. Hughes told Schuster’s team they hadn’t had even half as strong an improvement in their own system in the last four years.To be sure this wasn’t some fluke in the metric, they also turned to their pool of human contractors to do a side-by-side comparison. The user-perception scores, in which sample sentences were graded from zero to six, showed an average improvement of 0.4 — roughly equivalent to the aggregate gains of the old system over its entire lifetime of development.In mid-March, Hughes sent his team an email. All projects on the old system were to be suspended immediately.

Let’s pause to recognize what happened at Google.

Since its inception, Google has used every type of AI or machine learning technology imaginable. In spite of this, their average gain for improvement per year was only 0.4%. In Google’s first implementation, the improvement due to DL was 7 percentage points better.

This translates to more gains than the entire lifetime of improvements!

Google likely has the most talented AI and algorithm developers on the planet. However, several years of handcrafted development could not hold a candle to a single initial deep learning implementation.

Deep Learning is unexpectedly, and disruptively, taking over the world

“The revolution in deep nets has been very profound, it definitely surprised me, even though I was sitting right there.”

The deep learning progress has been taking the academic community by storm. Two articles by practitioners of classical machine learning have summarized why they think DL is taking over the world. Chris Manning, a renowned expert in NLP, writes about the “Deep learning Tsunami“ [MAN]:

Deep learning waves have lapped at the shores of computational linguistics for several years now, but 2015 seems like the year when the full force of the tsunami hit the major Natural Language Processing (NLP) conferences. However, some pundits are predicting that the final damage will be even worse.

It might be simply because deep learning on highly complex, hugely determined in terms of degrees of freedom graphs once endowed with massive amount of annotated data and unthinkable — until very recently — computing power can solve all computer vision problems. If this is the case, well it is simply a matter of time that industry (which seems to be already the case) takes over, research in computer vision becomes a marginal academic objective and the field follows the path of computer graphics (in terms of activity and volume of academic research).

Although I don’t want to detail the many deep learning developments of the past several years, Nell Watson provides a quick, short summary when she writes in “Artificial Intuition” [WAT]:

Make no mistake – Deep Learning is a “Disruptive” technology that is taking over operations of the most advanced technology companies in the world.

On Disruptiveness

Of late, the business world has become much more difficult and competitive. This situation has been made worse by disruptive changes in the global economy. The potential of nimbler competitors to disrupt the businesses of incumbents has never been greater. Peter Diamandis describes the Six D’s of Exponentials as consisting of the following:

Digitization – Anything that can be digitized can lead to the same exponential growth we find in computing. Anything that is digitized or virtualized instead is unencumbered by physical law. It thus costs less to mass produce and moves faster in spreading.

Deception – Once digitized or virtualized, initial growth deceptively appears linear. However, given time, exponential growth becomes obvious. For many it is too late to react once growth of a competitor hits this transition.

Disruption – New markets that are more effective and less costly are created. Existing markets that are tied to the physical world will eventually become extinct. We’ve seen this in music, photography and many other areas.

Demonetization – As cost heads towards zero, so does the ability to solicit a payment for it. Thus, a business has to reinvent its revenue model, or come up with new ways of monetization.

Dematerialization – Physical products disappear and are replaced by a more convenient and accessible alternative.

Democratization — More people now have access to technology at a lower cost. The means of production have become more accessible to everyone. This access is no longer confined to the big corporation, or the wealthy. We see this fragmentation everywhere where producers are publishing their own books, music and videos. This feeds back into itself and smaller players become able to compete.

To survive this disruption, there is an ever-pressing need for enterprises to take drastic action by re-engineering how they run their businesses.

John Hagel proposes four kinds of platforms [HAG] that leverage networking effects as an organizational mechanism to combat disruptive technologies. The four platforms that Hagel proposes are Aggregation platforms (example: Marketplaces), Social platforms (example: Social Networks), Mobilization platforms (example: Complex supply chains) and Learning platforms.

Learning platforms

Learning platforms are dynamic and adaptive environments where people come together to collectively learn how to address complex problems. Members can connect to ask questions, share experiences and offer advice. Open source projects that are actively managed with distributed source control, test-driven development, issue tracking, and continuous integration, is a good example of a learning platform. The key ingredient here is that there is a learning mechanism that gets codified continuously. The fact that we find this in software development should not come as a surprise, as software development is essentially a learning process.

John Hagel describes an intriguing property of a Learning platform:

What if we change the assumption, though? What if each fax machine acquired more features and functions as it connected with more fax machines? What if its features multiplied at a faster rate as more fax machines joined the network? Now, we’d have a second level of network effect — we’d still have the network effects that come by simply increasing the number of fax machines, but now there’s an additional network effect that accrues as each fax machine adds more and more features as a result of interacting with other fax machines.

What Hagel is saying is that the members of the network adaptively become more effective and capable as a participant in the learning network. In other words, not only is there the conventional networking effect, but another mechanism kicks network effects into overdrive. Learning platforms such as an open source community can further accelerate the disruptiveness of an already disruptive technology.

Historically, an open source strategy has been quite effective in many disruptive technology areas. In the Internet, Linux (79%) dominates in back end infrastructure services – Google’s Chrome (58%), Android (65%), Web-servers (65% Apache and Nginx). It should not surprise anyone when an open source strategy in the disruptive deep learning space eventually emerges as the dominant platform.

There are only a few semiconductor manufacturers that have the economies of scale to be competitive in high-performance computing. These are Nvidia, Intel, AMD, Qualcomm and Xilinx. We will now explore AMD’s deep learning solution and detail their unique open source strategy. We will also look at how it gives the company a competitive advantage.

Deep learning as a disruptive technology is critically enabled by hardware. AMD is one of the few semiconductor companies that actually exploits neural network in their hardware. In AMD’s SenseMI Infinity Fabric, an evolution of AMD HyperTransport interconnect technology, the design uses “perceptrons” to support branch prediction. AMD’s GPU hardware has always been competitive against Nvidia hardware. When algorithms are extensively optimized, AMD hardware is in fact favored. This is shown in the many cryptocurrency proof-of-work algorithms that have favored AMD hardware. Raja Koduri, head of AMD Radeon products, recently noted that AMD has had more compute per buck since 2005.

AMD’s Open Source Deep Learning Stack

Before we get into the detail of AMD’s deep learning stack, let’s look at the philosophy behind the development tooling. AMD, having a unique position of being both a CPU and GPU vendor, has been promoting the concept of a Heterogeneous System Architecture (HSA) for a number of years. Unlike most development tools from other vendors, AMD’s tooling is designed to support both their x86 based CPU and their GPU. AMD shares the HSA design and implementations in the HSA foundation (founded in 2012), a non-profit organization that has members including other CPU vendors like ARM, Qualcomm and Samsung.

The HSA foundation has an informative graphic that illustrates the HSA stack:

As you can see, the middleware (i.e. HSA Runtime Infrastructure) provides an abstraction layer between the different kinds of compute devices that reside in a single system. One can think of this as a virtual machine that allows the same program to be run on both a CPU and a GPU.

In November 2015, AMD announced the ROCm initiative to support High Performance Computing (HPC) workloads, and to provide an alternative to Nvidia’s CUDA platform. The initiative released an open source 64-bit Linux driver (known as the ROCk Kernel Driver) and an extended (i.e. non-standard) HSA runtime (known as the ROCr Runtime). ROCm also inherits previous HSA innovations such as AQL packets, user-mode queues and context-switching.

ROCm also released a C/C++ compiler called the Heterogeneous Compute Compiler (HCC) targeted to support HPC applications. HCC is based on the open-source LLVM compiler infrastructure project [WIKI]. There are many other open source versions of languages that use LLVM. Some examples are Ada, C#, Delphi, Fortran, Haskell, Java bytecode, Julia, Lua, Objective-C, Python, R, Ruby, Rust, and Swift. This rich ecosystem opens the possibility of alternative languages on the ROCm platform. One promising development of this kind is the Python implementation called NUMBA.

Added to the compiler is an API called HC which provides additional control over synchronization, data movement and memory allocation. HCC supports other parallel programming APIs, but to avoid further confusion, I will not mention them here.

The HCC compiler is based on work at the HSA foundation. This allows CPU and GPU code to be written in the same source file and supports capabilities such as a unified CPU-GPU memory space.

To further narrow the capability gap, the ROCm Initiative created a CUDA porting tool called HIP (let’s ignore what it stands for). HIP provides tooling that scans CUDA source code and converts it into corresponding HIP source code. HIP source code looks similar to CUDA code, but compiled HIP code can support both CUDA and AMD based GPU devices.

AMD took the Caffe framework with 55,000 lines of optimized CUDA code and applied their HIP tooling. 99.6% of the 55,000 lines of code was translated automatically. The remaining code took a week to complete by a single developer. Once ported, the HIP code performed as well as the original CUDA version.

HIP is not 100% compatible with CUDA, but it does provide a migration path for developers to support an alternative GPU platform. This is great for developers who already have a large CUDA code base.

Early this year AMD decided to get even “closer to the metal” by announcing the “Lightning Compiler Initiative.” This HCC compiler now supports the direct generation of the Radeon GPU instruction set (known as GSN ISA) instead of HSAIL.

As we shall see later, directly targeting native GPU instructions is critical to get higher performance. All the libraries under ROCm support GSN ISA.

The diagram depicts the relationships between the ROCm components. The HCC compiler generates both the CPU and GPU code. It uses different LLVM back ends to generate x86 and GCN ISA code from a single C/C++ source. A GSN ISA assembler can also [1] be used as a source for the GCN target.

The CPU and GPU code are linked with the HCC runtime to form the application (compare this with HSA diagram). The application communicates with the ROCr driver that resides in user space in Linux. The ROCr driver uses a low latency mechanism (packet based AQL) to coordinate with the ROCk Kernel Driver.

This raises two key points about what is required for high-performance computation:

1. The ability to perform work at the assembly language level of a device.

2. The availability of highly optimized libraries.

In 2015, Peter Warden wrote, “Why GEMM is at the heart of deep learning” [WAR] about the importance of optimized matrix libraries. BLAS (Basic Linear Algebra Subprograms) are hand-optimized libraries that trace their origins way back to Fortran code. Warden writes:

The Fortran world of scientific programmers has spent decades optimizing code to perform large matrix to matrix multiplications, and the benefits from the very regular patterns of memory access outweigh the wasteful storage costs.

This kind of attention to every detailed memory access is hard to replicate despite our advances in compiler technology. Warden went even further in 2017 when he wrote, “Why Deep learning Needs Assembler Hackers” [WAR2]:

I spend a large amount of my time worrying about instruction dependencies and all the other hardware details that we were supposed to be able to escape in the 21st century.

Despite being a very recent technology, software that enables deep learning is a complex stack. A common perception is that most deep learning frameworks (i.e. TensorFlow, Torch, Caffe etc) are open source. These frameworks are however built on highly optimized kernels that are often proprietary. Developers can go to great lengths to squeeze every ounce of performance from their hardware.

I basically came to the conclusion that it was not possible to fully utilize the hardware I bought with the tools Nvidia provides. Nvidia, unfortunately, doesn’t believe in eating their own dog food and they hand assemble their library routines, rather than use ptxas like the rest of us have to.

Gray used assembly language to write their kernels, thus creating algorithms that bested the proprietary alternatives. Now imagine how much less work he would have to do if the assembly language was available and documented. This is what AMD is bringing to the table.

The ROCm initiative provides the handcrafted libraries and assembly language tooling that will allow developers to extract every ounce of performance from AMD hardware. This includes a rocBLAS [KNOX], an implementation of BLAS that provides these level capabilities:

BLAS Level-1:

amax, amin, asum, axpy, copy, dot, nrm2, scal, swap

BLAS Level-2:

gemv

BLAS Level-3:

gemm, trtri, batched-trtri

This is implemented from scratch with a HIP interface. AMD has even provided a tool (i.e. Tensile) that supports the benchmarking of rocBLAS. AMD also provides an FFT library called rocFFT that is also written with HIP interfaces.

I wonder if Facebook’s fbcunn (Deep learning extensions for CUDA) [GIT], a library that employs FFTs to accelerate convolutions, could be ported using the HIP tooling.

Deep learning algorithms continue to evolve at a rapid pace. In the beginning, frameworks exploited the available matrix multiplication libraries. These finely tuned algorithms have been developed over decades. As research continued, newer kinds of algorithms were proposed.

Thus came the need to go beyond generic matrix multiplication. Convolutional networks came along and this resulted in even more innovative algorithms. Today, many of these algorithms are crafted by hand using assembly language.

Here is a partial list of deep learning specific optimizations that are performed by a proprietary library:

These low-level tweaks can lead to remarkable performance improvements. For some operations (i.e. batch normalization), the performance increases 14 times compared to a non-optimized solution.

AMD is set to release a library called miOpen that includes handcrafted optimizations. This library includes Radeon GPU-specific optimizations for operations and will likely include many of those described above. MiOpen is scheduled for a release in the first half of this year. Its release will coincide with the release of other popular deep learning frameworks such as Caffe, Torch7, and TensorFlow. This will allow application code that uses these frameworks to perform competitively on Radeon GPU hardware.

Many other state-of-the-art methods have not yet worked their way into proprietary deep learning libraries. These are proposed almost every day as new papers are published in Arxiv.

Here are just a few:

CReLU

PReLU

Hierarchical Softmax

Adaptive Softmax

Layer Normalization

Weight Normalization

Wasserstein Loss

Z-Loss

It would be very difficult for any vendor to keep up with such a furious pace. In the current situation, given the lack of transparency in development tools, developers are forced to wait, although they would rather be performing the coding and optimizations themselves. Fortunately, the open source ROCm initiative solves the problem.

ROCm includes an open source GCN ISA based assembler and disassembler.

System Wide Optimization

In a recent investor’s meeting by Intel, the company shared some of their statistics:

Among servers used for deep learning applications, the chipmaker says that 91% use just Intel Xeon processors to handle the computations, 7% use Xeon processors paired with graphics processing units, while 2% use alternative architectures altogether.

The mix will change as the value of deep learning is understood better. The point here is that CPUs will always be required, even if most of the computations are performed by GPUs. That being said, it is important to recognize that system-wide optimizations are equally critical. This is where AMD’s original investments in Heterogeneous System Architecture may pay big dividends. I would however like to point out that new research efforts are underway to optimize the code that is emitted by deep learning frameworks further.

Deep learning frameworks like Caffe and Tensorflow have internal computational graphs. These graphs specify the execution order of mathematical operations, similar to a dataflow. These frameworks use the graph to orchestrate its execution on groups of CPUs and GPUs. The execution is parallel and this is one reason why GPUs are ideal for this kind of computation. There are however plenty of untapped opportunities to improve the orchestration between the CPU and GPU.

The current state of Deep Learning frameworks is similar to the state before the creation of a common code generation backend like the LLVM. In the past, every programming language had its own way of generating machine code. With the development of LLVM, many languages now share the same backend code. The frontend code only needs to translate source code to an intermediate representation (IR). Deep Learning frameworks will eventually need a similar IR for Deep Learning solutions. The IR for Deep Learning is the computational graph.

New research is exploring ways to optimize the computational graph in a way that goes beyond just single device optimization and towards more global multi-device optimization.

An example of this is the research project XLA (Accelerated Linear Algebra) from the TensorFlow developers. XLA supports both Just in Time (JIT) or Ahead of Time (AOT) compilation. XLA is a high-level optimizer that performs its work by optimizing the interplay of the CPUs, GPUs and FPGAs.

The optimizations planned include:

Fusing of pipelined operations

Aggressive constant propagation

Reduction of storage buffers

Fusing of low-level operators

There are two other open source projects that are also exploring computational graph optimization. NNVM from the MXNet developer is another computation graph optimization framework that, similar to XLA, provides an intermediate representation. The goal is for optimizers to reduce memory and device allocation, while preserving the original computational semantics.

NGraph from Intel is exploring optimizations that include:

Kernel fusion

Buffer allocation

Training optimizations

Inference optimizations

Data layout

Distributed training

There are certainly plenty of ideas around of how to improve the performance.

AMD has developed a runtime framework that takes into account heterogeneous CPU-GPU systems. It is called Asynchronous Task and Memory Interface (ATMI). The ATMI runtime is driven by a declarative description of high-level tasks that will execute the scheduling and memory in an optimal manner.

ATMI is also open source and can be exploited to drive deep learning based computational graphs like the ones found in XLA, NNVM or NGraph. The future of Deep Learning software will revolve around a common computational graph and optimizations will take the orchestration of the entire system into consideration.

Operations and Virtualization

What we have been discussing so far are the opportunities to squeeze as much performance from hardware as possible, but there is more to a complete solution than just raw performance.

Every complex system requires good manageability to ensure continued and sustained operations. The ROCm initiative does not overlook this need and provides open source implementations. ROC-smi, ROCm-Docker and ROCm-profiler are three open source projects that provide support for operations.

AMD’s GPU hardware and drivers have also been designed to support GPU virtualization (see: MxGPU). This permits GPU hardware to be shared by multiple users. I will discuss operational aspects of AMD’s offerings in a next article.

Deployment

Throughout this article, we’ve discussed the promising aspects of the ROCm software stack. When the rubber meets the road, we need to discuss the kind of hardware that software will run on. There are many different scenarios where it makes sense to deploy deep learning. Contrary to popular belief, not everything needs to reside in the cloud. Self-driving cars or universal translation devices need to operate without connectivity.

Deep learning also has two primary modes of operation – “training” and “inference”. In the training mode, you would like to have the biggest, fastest GPUs on the planet and you want many of them. In inference mode, you still want fast, but the emphasis is on economic power consumption. We don’t want to drive our businesses to the ground by paying for expensive power.

In summary, you want a variety of hardware that operates in different contexts. That’s where AMD is in good position. AMD has recently announced some pretty impressive hardware that’s geared toward deep learning workloads. The product is called Radeon Instinct and it consists of several GPU cards: the MI6, MI8, and MI25. The number roughly corresponds to the number of operations the card can crank out. An MI6 can perform roughly 6 trillion floating-point operations per second (aka teraflops).

The Radeon Instinct MI6 with a planned 16GB for GDDR5 memory is a low-cost inference and training solution. MI8 with 4GB HBM is designed primarily for inference-based workloads. MI25 is designed for large training workloads and will be based on the soon to be released Vega architecture. Shuttling data back and forth between GPU and CPU is one of the bottlenecks in training deep learning systems. Vega’s unique architecture, capable of addressing 512TB of memory, gives it a distinct advantage.

There’s also a lot more to say about GPU and CPU integration. I’ll briefly mention some points. On the server-side, AMD has partnered with Supermicro and Inventec to come up with some impressive hardware. At the top of the line, the Inventec K888 (dubbed “Falconwitch”) is a 400-teraflop 4U monster. By comparison, the Nvidia flagship DGX-1 3U server can muster a mere 170 teraflops.

There is also promise at the embedded device level. AMD already supports custom CPU-GPU chips for Microsoft’s Xbox and Sony’s PlayStation. An AMD APU (i.e. CPUs with integrated GPUs) can also provide solutions for smaller form factor devices. The beauty of AMD’s strategy is that the same HSA based architecture is available for the developer in the smallest of footprints, as well as in the fastest servers. This breadth of hardware offerings allows deep learning developers a wealth of flexibility in deploying their solutions. Deep learning is progressing at breakneck speed and one can never predict the best way to deploy a solution.

Conclusion

Deep learning is a disruptive technology like the Internet and mobile computing that came before. Open source software has been the dominant platform that has enabled these technologies.

AMD combines these powerful principles with its open source ROCm initiative. On its own, this definitely has the potential of accelerating deep learning development. ROCm provides a comprehensive set of components that address the high performance computing needs, such as providing tools that are closer to the metal. These include hand-tuned libraries and support for assembly language tooling.

Future deep learning software will demand even greater optimizations that span many kinds of computing cores. In my view, AMD’s strategic vision of investing heavily in heterogeneous system architectures gives their platform a distinct edge.

AMD’s open source strategy is uniquely positioned to disrupt and take the lead in future deep learning developments.

Carlos E. Perez is Co-Founder at Intuition Machine. He specializes in Deep Learning patterns, methodology and strategy. Many of his other writings on Artificial Intelligence can be found on Medium. His postings are his own opinions and may not represent AMD’s positions, strategies, or opinions. Links to third party sites and references to third party trademarks are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.

Today in San Francisco, California, AMD held a special event where we announced the newest additions to the Radeon Instinct™ family of compute products; the AMD Radeon Instinct™ MI60 and Radeon Instinct™ MI50. In step with the new hardware, the Radeon Open eCosystem (ROCm) has been updated with massive improvements in the device drivers, the compilers and supporting tools. The low-level math libraries, along with MIOpen, the machine intelligence library, have been optimized to really make deep learning applications sing.

ROCm is an open software platform for GPU-enabled HPC computing. It was created with developers in mind to accommodate future technologies including machine learning and artificial intelligence. As an open platform, the ROCm ecosystem provides a rich foundation of modern programming languages, designed to speed development of high-performance, energy-efficient heterogeneous computing systems.

We enabled AMD’s ROCm capable GPUs in the Linux ecosystem for easy deployment of deep learning applications in Linux distributions. The amdkfd device driver is now supported in the mainline kernel and this kernel is picked up by all the major distributions for their standard releases. Now we also support MI60 and MI50, based on the new Vega architecture, in the linux-next repository. For distributions not using the latest kernel, a DKMS build is still a viable option to add support for the MI60 and MI50 GPUs.

We have updated the LLVM based clang compiler to support the new GPU architecture, including the new compute instructions targeted to accelerate machine learning computations. These low-level instructions implement compute operations all the way from single bit precision to 64-bit floating point. The most beneficial instruction for the acceleration of deep learning training is a float 16 dot product which accumulates into a 32-bit result, maintaining the accuracy of the operation.

Profiling and debugging tools required updates to support the new hardware. These tools enable developers to get the most out of the GPU compute cycles and understand where the bottlenecks occur in their applications. Follow the development on our github site.

Math libraries were customized with the hardware architecture in mind, resulting in an very optimized solution. There are many different ways to optimize these math operations, and each specific matrix and convolution size needs to be tuned, so AMD built a tool to help automate the optimization process. This tool is called Tensile and is very useful for creating a library for GEMMs, GEMM-like problems (such as batched GEMM), N-dimensional tensor contractions, and anything else that multiplies two multi-dimensional objects together on a GPU. MIOpen also underwent massive optimizations and updates to realize the incredible benefits of the foundational math libraries when integrated with deep learning frameworks.

One of the most exciting developments over the past year is the integration and progress with the machine learning frameworks. ROCm has been updated to support the TensorFlow framework API v1.11 and is actively upstreaming the code into the main repository. Check out the TensorFlow github to follow the updates or see our github page for PyTorch, Caffe2, Caffe and other framework developments.

To try out the newest packages, develop an application and easily deploy a ROCm solution, get the most recent Docker images here - which saves you the time of collecting all the libraries and building them specifically for your platform.

We are always looking for skilled developers excited to work in this rapidly changing field. Check out our job listings at amd.com.