AnySL: Efficient and Portable Multi-Language Shading

Scenes rendered with AnySL and the ray tracers RTfact, Manta, and PBRT.
The surface shaders are written in RenderMan or scalar C++ and all compiled to a platform-independent intermediate representation.
The AnySL system loads, vectorizes, and optimizes the shaders and seemlessly integrates them into the renderer at runtime, achieving close to native shading performance.

About

In cooperation with the Computer Graphics Group we develop a unified shading system that is independent of source language, target architecture and rendering engine without sacrificing runtime performance.

Our goal is to eventually provide a shading-system that uses a portable shader-format to allow integration into any kind of rendering engine (e.g. ray-tracing, rasterization, global illumination).
Additionally, integration of existing shading-languages only requires minimal effort while the compiler technology of AnySL still enables maximum performance.

Shaders denote program fragments that extend the functionality of a rendering system for specific tasks such as computing emission, light-material interaction, or geometry processing --- similar to plug-ins used elsewhere.
The key difference to such function-call and library-based plug-ins is that shading code usually needs to be transformed to meet the needs of the target applications regarding performance or program structure and should provide convenience for the programmer.
However, to support a certain shading language, the compiler has to provide a compiler framework for it.
Hence, the renderer's implementor ends up in investing a large part of his time in creating compilers; something he did not want to do in the first place.

AnySL is a novel approach to ease the integration of a shading language into a renderer.
We compile shaders into a program representation that is independent of the shading language, the renderer, and the target hardware platform.
The renderer has to provide the implementation of the basic constructs of the shading language.
By augmenting the renderer with a just-in-time compiler library, the shaders are loaded and "glued" to the renderer's interface at runtime.
Afterwards, the shader is mapped to the underlying hardware platform.
With this approach, all performance obstacles incurred by common programming abstraction mechanisms are optimized away, resulting in high performance while keeping the maximum flexibility.

The AnySL Shading System uses an embedded just-in-time compiler (the "Low-Level Virtual Machine" (LLVM)) to load, specialize and optimize shaders at runtime. This allows us to recompile on the fly, e.g. after modifications of shader parameters, without sacrificing performance.

RTfact/AnySL

Manta/AnySL

PBRT/AnySL

Whole-Function Vectorization

For ray tracing engines that employ packet tracing, the scalar shader code is automatically transformed to packet code that operates on packets of data (that are sized depending on the target architecture's SIMD width).
This allows to exploit the SIMD instruction sets of CPUs (e.g. SSE, AltiVec) without putting the burden of writing such complex and error-prone code on the shader programmer.
The only option to this is sequential shading of all rays of a packet, which incurs a lot of overhead if the ray tracer operates on SIMD datatypes because packets have to be split before execution and results have to be merged again.

Compared to sequential shading, we obtain an average speedup factor of 3.9 of the entire rendering process in RTfact.
At the same time, we reach over 90% of the performance of the hand-written, native shaders.

LLVM PTX Backend

As part of the AnySL system we implemented an LLVM backend for NVIDIA's "Parallel Thread Execution" (PTX) assembly language.
PTX is the low-level representation fed to NVIDIA GPGPU graphics drivers and is usually generated by compilers for the "Compute Unified Device Architecture" (CUDA).

The backend is similar to LLVM's C-backend and generates .ptx files directly from LLVM's intermediate representation (IR).

The backend already supports most of the PTX features:

simple arithmetic (add, mul, ...)

control flow

structs and arrays

simple function calls (no recursion, no struct returns)

global, shared, constant, and texture memory access

mathematical functions (e.g. sin, cos, sqrt, pow, ...)

special registers (e.g. thread_id)

There are no intrinsics for PTX-specific functionality like texture fetches,
they are currently only accessed via external functions.
Atomic and synchronization instructions are not yet implemented but should
work the same way.

Performance has not yet been optimized to a larger degree.
Register pressure lowering optimizations are necessary for more performant code.

The backend was written as part of the bachelor's thesis of Helge Rhodin.
The source code is released under the University of Illinois/NCSA Open Source License (BSD-style) and is hosted at SourceForge.

Publications

Conferences

@CONFERENCE{KH:2011:cgo,
author = {Ralf Karrenberg and Sebastian Hack},
title = {{W}hole {F}unction {V}ectorization},
booktitle = {International Symposium on Code Generation and Optimization},
series = {CGO},
year = {2011},
doi = {10.1109/CGO.2011.5764682},
abstract = {
Abstract—Data-parallel programming languages are an important component
in today's parallel computing landscape. Among those are domain-
specific languages like shading languages in graphics (HLSL, GLSL,
RenderMan, etc.) and "general-purpose" languages like CUDA or OpenCL.
Current implementations of those languages on CPUs solely rely on multi-
threading to implement parallelism and ignore the additional intra-core
parallelism provided by the SIMD instruction set of those processors
(like Intel's SSE and the upcoming AVX or Larrabee instruction sets).
In this paper, we discuss several aspects of implementing data-parallel
languages on machines with SIMD instruction sets. Our main contribution
is a language- and platform-independent code transformation that
performs whole-function vectorization on low-level intermediate code
given by a control flow graph in SSA form.
We evaluate our technique in two scenarios: First, incorporated in a
compiler for a domain-specific language used in real-time ray tracing.
Second, in a stand-alone OpenCL driver. We observe average speedup
factors of 3.9 for the ray tracer and factors between 0.6 and 5.2 for
different OpenCL kernels.
},
webslides = {http://www.cdl.uni-saarland.de/projects/wfv/wfv_cgo11_slides.pdf},
url = {http://www.cdl.uni-saarland.de/papers/karrenberg_wfv.pdf},
acc_rate = {26.7},
accepted = {28},
submitted = {105},
}

@CONFERENCE{KRSH:2010:hpg,
author = {Ralf Karrenberg and Dmitri Rubinstein and Philipp Slusallek and Sebastian Hack},
title = {{AnySL: Efficient and Portable Shading for Ray Tracing}},
booktitle = {Proceedings of the Conference on High Performance Graphics},
series = {HPG '10},
year = {2010},
location = {Saarbrucken, Germany},
pages = {97--105},
numpages = {9},
url = {http://portal.acm.org/citation.cfm?id=1921479.1921495},
acmid = {1921495},
publisher = {Eurographics Association},
address = {Aire-la-Ville, Switzerland, Switzerland},
booktitle_short = {HPG},
abstract = {
While a number of different shading languages have been developed,
their efficient integration into an existing renderer is notoriously
difficult, often boiling down to implementing an entire compiler
toolchain for each language. Furthermore, no shading language is
broadly supported across the variety of rendering systems.
AnySL attacks this issue from multiple directions: We compile shaders
from different languages into a common, portable representation, which
uses subroutine threaded code: Every language operator is translated to
a function call. Thus, the compiled shader is generic with respect to
the used types and operators.
The key component of our system is an embedded compiler that
instantiates this generic code in terms of the renderer's native types
and operations. It allows for flexible code transformations to match
the internal structure of the renderer and eliminates all overhead due
to the subroutine threaded code. For SIMD architectures we
automatically perform vectorization of scalar shaders which speeds up
rendering by a factor of 3.9 on average on SSE. The results are highly
optimized, parallel shaders that operate directly on the internal data
structures of a renderer. We show that both traditional shading
languages such as RenderMan, but also C/C++-based shading languages,
can be fully supported and deliver high performance across different
CPU renderers.
},
webslides = {http://www.cdl.uni-saarland.de/projects/anysl/anysl_hpg10_slides.pdf}
}

MSc Thesis

@MASTERSTHESIS{Karrenberg:2009:MSc,
author = {Ralf Karrenberg},
title = {{Automatic Packetization}},
school = {Saarland University},
year = {2009},
month = {July},
webpdf = {http://www.cdl.uni-saarland.de/publications/theses/karrenberg_msc.pdf},
abstract = {
Modern processor architectures provide the possibility to execute an
instruction on multiple values at once. So-called SIMD (Single
Instruction, Multiple Data) instructions work on packets (or vectors)
of data instead of scalar values. They offer a significant performance
boost for data-parallel algorithms that perform the same operations on
large amounts of data, e.g. data encoding and decoding, image
processing, or ray tracing.
However, the performance gain comes at a price: programming languages
provide no elegant means to exploit SIMD instruction sets. Packet
operations have to be coded by hand, which is complicated, unintuitive,
and error prone. Thus, packetization - the transformation of scalar
code to packet form - is mostly applied automatically by local compiler
optimizations (e.g. during loop vectorization) or with a lot of manual
effort at performance-critical parts of a system.
This thesis describes an algorithm for automatic packetization that
allows a programmer to write scalar functions but use them on packets
of data. A compiler pass automatically transforms those functions to
work on packets of the target-architecture's SIMD width. The resulting
packetized function computes the same results as multiple executions of
the scalar code.
The algorithm is implemented in a source-language and target-
architecture independent intermediate representation (the Low Level
Virtual Machine (LLVM)), which enables its use in many different
environments. The performance of the generated code is shown in a real-
world case study in the context of real-time ray tracing: serial shader
code written in C++ is automatically specialized, optimized, and
packetized at runtime. The packetized shaders outperform their scalar
counterparts by an average factor of 3.6 on a standard SSE architecture
of SIMD width 4.
}
}