File size

File size

File size

David Tarditi and Sidd Puri are doing some really cool work over in Microsoft Research. They've built a development technology, Accelerator, that "provides a high-level data-parallel programming model as a library that is available for all .Net programming
languages. The library translates the data-parallel operations on-the-fly to optimized GPU pixel shader code and API calls. Future versions will target multi-core cpus." Watch this video!

I am trobled by how many times charles is the one trying to introduce groups to what other people are working on. It would be great to improve the communcation that should be going on while at the same time cutting down on the e-mail that is overwhelming you
all there.

If you can improve comunication and get all these ideas to work together I see a real future with some of the things that may be possible in the next 10-15 year range. Also if microsoft can learn from it's mistakes and be more agile, have more ctp (customer
technology preview) and take the feedback from that maybe microsoft can get it
great in the second version instead of the third.

Defered evalution is definately a very interesting subject. The work being done with LINQ is along the same lines. The compiler generates an expression tree that can then be passed around as data and transformed before evaluation. I wonder if its possible to take expression trees generated by LINQ and transform them into parallelisable computations. I suppose it really comes down to "map" and "reduce" functions in the end. Whilst you are kind of limited to pure arithmetic operations
in the GPU, the future of multi-cores certainly could widen the scope.

Of course, I can't talk about abstract syntax trees without once again mentioning syntactic macros It would be interesting to look at using syntactic macros to perform staged computation. I'm sure some of the parallelising of operations can be decided at
compile time. That could make for even more performance increases since you can take some weight off the JIT compiler.

Anyway... Awesome work and great video Charles.BTW: Charles, you need to get a secondary job at MSR being "social glue"! We need to get all these academics down to the bar to mix their ideas.

Thanks for the video. It should be noted, however, that one can create an array "computation" type library today. It is merely a matter of syntax and abstracting away the loops from the end user of your functions. Whether it would take advantage of a multi-core
system setup is another thing entirely, so good work here.

The brief GPU discussion/explanation was also interesting. Additionally, I'm curious, however, as to what the conversion routines for data parallel arrays and regular arrays look like. Perhaps I should check out the SDK? Is much code shared in the SDK or is
it pretty much a "black box" type approach?

I'd highly recommend that you download the bits and play with the Accelerator "platform". Research needs your input because you represent the real world and will find problems and or find needs that will help the team build what ultimately serves your
purposes.

Data parallelism for big numerical problems is kind of obvious. I think the next challenge is bringing parallelism to regular business apps. For example, if I have a list of business objects and want to validate them all, or maybe check for changes against
a web service, doing a simple "foreach" loop is dumb when I have 2 or more CPUs. Maybe one day we will languages and compilers smart enough to just express "validate all these objects" and have it work out the most efficient way to do it...

I like the idea of having a library that makes general purpose computation on GPU really easy.

However, the example used in the video of the 5x5 convole is both illusory and hints at the core problem with the approach to making parallelism easy to code in the wider world.

If you load a big (1600x1200) image into photoshop and do a radius 5 gaussian blur, you're looking at about 1 second of processing even on my relatively old PC (AMD Athlon XP 2000). And that's because Adobe have hand-optimized their filter routines using the
most efficient approach running on a CPU. That is it performs the whole matrix operation on a small area of the image at a time, it 'touches' areas of memory before it needs them so they'll be in the cache when it does need them, and of course it uses MMX/SSE
to exploit the small amount of SIMD power current CPUs have.

The routine shown to us in the video takes a different approach. It composits the whole image repeatedly, offset by a given number of pixels each time. That's the definitive way to perform that operation on a GPU, but it's devastatingly inefficient to do that
on a CPU compared to the conventional way of doing it (as shown by Adobe).

Now it was kind-of glossed over in the video, but I believe the interviewees were saying that they are trying to come up with a way of making data-level parallelism easy to code for both GPUs and multi-core scenarios. They also touted their library approach
as being a lot simpler than the conventional high-performance computing approach of having a special compiler pick apart the loops in the problem and work out what to run where. They state that by encoding higher level operations in library calls, that the
intention of the program is encoded and the library then works out what to do.

The problem there is that the intention here - to perform a 5x5 convolve by repeatedly compositing an offset image - it right for the GPU, but wrong for the CPU.

Now I suppose you could be really clever with your deferred computation and 'unpick' the intention from the series of composition calls that the nested for loops in the example produce and then work out a more efficient way to execute them on a CPU. But that's
likely to only work in limited situations, where the same operation is exectuted over and over. I think it would be better to admit that there's no way to succictly encode the intention of program to a computer (this is a problem that mathematicians have grappled
with since before there were computers) and just concentrate on producing useful libraries for the two different scenarios. But hey, you're the researchers!

I am trobled by how many times charles is the one trying to introduce groups to what other people are working on. It would be great to improve the communcation that should be going on while at the same time cutting down on the e-mail that is overwhelming
you all there.

I think the exact same thing every time. A lot of projects with overlapping goals. I know Microsoft is a big company, but a keyword searchable database of current/past projects might do wonders.

jvervoorn wrote:I am trobled by how many times charles is the one trying to introduce groups to what other people are working on. It would be great to improve the communcation that should be going on while at the same time cutting down
on the e-mail that is overwhelming you all there.

I think the exact same thing every time. A lot of projects with overlapping goals. I know Microsoft is a big company, but a keyword searchable database of current/past projects might do wonders.

I'm not sure I fully understand the problem here. Programming concurrent applications is hard and there is no single silver bullet to make it easy (it's a hard problem). Accelerator is but one approach to a specific subset of the problem, just as Software Transactional
Memory (video to appear on C9 next week) and language-level solutions are. I am just investigating what's being done around the company to address this important programming topic.

Can you elaborate more on what you see as the problem with this approach? I'm open to suggestions, as always. In fact, I'd love some more feedback.

﻿I'm not sure I fully understand the problem here. Programming concurrent applications is hard and there is no single silver bullet to make it easy (it's a hard problem). Accelerator is but one approach to a specific subset of the problem, just as Software
Transactional Memory (video to appear on C9 next week) and language-level solutions are. I am just investigating what's being done around the company to address this important programming topic.

Can you elaborate more on what you see as the problem with this approach? I'm open to suggestions, as always. In fact, I'd love some more feedback.

C

Sorry that must have come out wrong. I'm not complaining that there are videos covering the similar topics.

I'm just surprised that projects with overlapping goals are unaware of each other. It seems to me that the CCR team and the Accelerator team might be able to share some useful informationg with each other. I think it's great that you are able to drop in the
recommendation that they take a look at each other's solutions. Again, I'm just surprised it doesn't happen automatically.

However, and let this be my disclaimer, I'm not a Microsoft insider and further have no clue how these things work.

﻿Defered evalution is definately a very interesting subject. The work being done with LINQ is along the same lines. The compiler generates an expression tree that can then be passed around as data and transformed before evaluation. I wonder if its possible to take expression trees generated by LINQ and transform them into parallelisable computations. I suppose it really comes down to "map" and "reduce" functions in the end. Whilst you are kind of limited to pure arithmetic operations
in the GPU, the future of multi-cores certainly could widen the scope.

Of course, I can't talk about abstract syntax trees without once again mentioning syntactic macros
It would be interesting to look at using syntactic macros to perform staged computation. I'm sure some of the parallelising of operations can be decided at compile time. That could make for even more performance
increases since you can take some weight off the JIT compiler.

Yes, staged computation is definitely an interesting way to go. As you point out, some of the work done by the libary could be at "compile-time" (or at least earlier than currently is done).

I like the idea of having a library that makes general purpose computation on GPU really easy.

However, the example used in the video of the 5x5 convole is both illusory and hints at the core problem with the approach to making parallelism easy to code in the wider world.

If you load a big (1600x1200) image into photoshop and do a radius 5 gaussian blur, you're looking at about 1 second of processing even on my relatively old PC (AMD Athlon XP 2000). And that's because Adobe have hand-optimized their filter routines using the
most efficient approach running on a CPU. That is it performs the whole matrix operation on a small area of the image at a time, it 'touches' areas of memory before it needs them so they'll be in the cache when it does need them, and of course it uses MMX/SSE
to exploit the small amount of SIMD power current CPUs have.

The routine shown to us in the video takes a different approach. It composits the whole image repeatedly, offset by a given number of pixels each time. That's the definitive way to perform that operation on a GPU, but it's devastatingly inefficient to do that
on a CPU compared to the conventional way of doing it (as shown by Adobe).

Now it was kind-of glossed over in the video, but I believe the interviewees were saying that they are trying to come up with a way of making data-level parallelism easy to code for both GPUs and multi-core scenarios. They also touted their library approach
as being a lot simpler than the conventional high-performance computing approach of having a special compiler pick apart the loops in the problem and work out what to run where. They state that by encoding higher level operations in library calls, that the
intention of the program is encoded and the library then works out what to do.

The problem there is that the intention here - to perform a 5x5 convolve by repeatedly compositing an offset image - it right for the GPU, but wrong for the CPU.

Now I suppose you could be really clever with your deferred computation and 'unpick' the intention from the series of composition calls that the nested for loops in the example produce and then work out a more efficient way to execute them on a CPU. But that's
likely to only work in limited situations, where the same operation is exectuted over and over. I think it would be better to admit that there's no way to succictly encode the intention of program to a computer (this is a problem that mathematicians have grappled
with since before there were computers) and just concentrate on producing useful libraries for the two different scenarios. But hey, you're the researchers!

Actually, if we computed all the intermediate arrays implied by the high-level code, performance would be disastrous on the GPU too, because you'd use way too much memory bandwidth and destroy the spatial locality.

All of the C# for-loops end up unrolled and you end up with one large expression graph being passed to the library. The graph would imply lots of intermediate arrays being computed.

We actually convert the graph to something of the following form:

1. For each output pixel of the convolution, execute a sequential piece of code.2. The sequential piece of code fetches the neighboring pixels and adds them together.

The sequential piece of code corresponds to the body of the pixel shader. Now, if you want good performance, you need to traverse the output pixels in the correct order to preserve spatial locality. Fortunately, the GPU traverses the output pixels in a
reasonable order (these are 2-D images, after all).

Details of how we do this are described in our technical report (accessible from the Accelerator Wiki). The TR will soon be superceded by a paper that will appear in ASPLOS '06 that we hope does a better job of describing the details.

You are correct that it is quite difficult to capture the "intention" of a programmer. Our point was simple, which is that a good start would be to avoid over-specifying the behavior of the program, which is what happens if you write the code in C/C++ using
for loops that specify the exact order in which individual array elements are accessed. One must wonder why Adobe had to hand-code the blocking that you describe and why a compiler couldn't do that. The answer, as you allude to, is that in the conventional
high-performance computing approach, the compiler has to do some pretty heroic stuff.

To argue that other side, you could say say that our approach results in a program that is too underspecified ... the area in between overspecified and underspecified is the interesting area to investigate.

﻿What happens for those lucky people with dual video cards? Can Accelerator use both in parallel?

(No I don't have dual cards, I just like the idea!)

No, Accelerator can't use both in parallel. We wish we could

It's a neat idea, but it's a harder problem because it changes the hierarchy of the memory. You have to figure out how to partition the program across that hierarchy. With a single GPU, you are accessing the local memory on the graphics card, which is very
high-bandwidth (>50 GB/s). With multiple GPUs, unless you partition the problem just right, you may need to access memory on another card across the bus. PCI-Express is fast, but not nearly as fast as the memory on the graphics card.

Now is this entire library written in managed code (i.e. on top of the .NET framework)? If so, how is your code can specifically target the GPU (as opposed to the CPU) without modifications to the underlying structure of the .NET framework? Perhaps I am
not familiar enough with the deep internal structure and design of the .NET framework, whereas if I was, I may not ask such a question. Wouldn't that also impede performance if the garbage collector was involved? Could this be written to execute even faster
under something unmanaged like C or C++ (or any compiled language)?

﻿I'm curious - why not implement this as a library on top of multi - core CPU's (which seems a much moreuseful Scenario) rather than a GPU ?

(or perhaps You find the limited Ps instruction set easier to start out with)

Parallel data and parallel instructions are two different beasts I guess. Trying to operate on a single dataset from multiple processors causes all kinds of memory/cache issues. When you can split the data up and work independently then its fine. However when
you can't, the only performant way to operate is in one processor. In this case taking advantage of the data parallelism inside a single GPU.Of course, I'm not an expert by any means in this area... hopefully the boffins at MSR are finding clever solutions to these tricky problems.

rhm wrote:[...]The problem there is that the intention here - to perform a 5x5 convolve by repeatedly compositing an offset image - it right for the GPU, but wrong for the CPU.
[...]

[...]Now, if you want good performance, you need to traverse the output pixels in the correct order to preserve spatial locality.[...]

You're quite right, rhm, that different target platforms have different issues, and that you have to adapt the structure of your program to your processor if you want the comparison to be meaningful. I can assure you that in our convolution benchmark, the
CPU version we compare against is quite clever about how it iterates.

For our multi-core backend, we are indeed being as ambitious as you suggest. Our goal is to tailor the loop ordering to suit the machine. There have been decades of research into automatic loop transformations (strip-mining, tiling, skewing, ...), so the
idea of doing this in a compiler isn't novel. As David points out, the advantage we have is that the program is specified at a higher level, so we don't have to burn cycles trying to figure out which transformations we can legally apply without breaking a
data dependency.

If i look to this video i can not prevent myself to think that there is nothing new here and i am quite surprised to see that Microsoft is so late in this. Yes Apple (now i am sure that many windows fanboy will treat me of mac troll, but anyway!!!!) have
been doing many work on data parallelism for many years now. I mean Apple has been working on APIs for SIMD programming for many years that provide data parallelism for image processing, scientific application, signal processing, math computing, etc..... This
API is called Accelerate framework and it just do all the job for the developper. No need to worry which architecture your programm will run (Powerpc or Intel), the APIs does the optimisation for you, the vectorizing for you, and the architecture dependent
optmization for you. No need to worry about data alignment, or vector inctrcustion, etc... It just provide the all abstraction, and this certainly why SIMD computing has been far more spread on mac compared to windows. On pc you could use Intel vectorizing
tools, but that's expensive and still the level of abstraction is not quite high or as high as a developper would like to be. Now talking about GPU processing, i can not see anything impressive in this video. Apple (yes again Apple, sorry!!) is already proposing
TODAY (not a research project) an object oriented API for high-end application and data-parallelism computing. CoreImage and Corevideo does just that. They provide an abstraction model that provides to the developers a model for GPU programming, CoreImage
uses OpenGl, OpenGl Shading language and works on programmable GPU. Developpers do no need to know how GPU works or how OpenGl works, CoreInage and CoreVideo provide all the abstraction with an object oriented programming model built with Cocoa. You don't
need to know about graphical programming and computer graphics mathematics either, CoreImage/Video abstract all of these. Moreover CoreImage/Video does the optimization on the fly for a given application, depending on the architecture on which the program
runs. It does optimize and scale performances depending on the ressource you have. In another words, it optimizes for the GPU if the hardware allows it, otherwise it will optimize for Altivec (SIMD computing) on G4/G5 or for SSE on Intel. It will also optimize
for multi-processors machines or multicores machines if it needs/can do so. CoreImage/Video also provide a set of built in Image Units that perform general graphical effect, blur effect, distorsion, morphology, you name it, all running on GPU. CoreImage/Video
use a non-destructive mechanism and 32-bit floating point numbers. The architecture is completely modular, any developper can buld it own image unit. Anyone call download a test application named "FunHouse" in the Apple development tools that performs REAL
TIME image processing using the GPU. Much more impressive compared to their demo i would say. And more important high end applications like Motion and FinalCut Pro 5 Dynamic RT technology leverage CoreImage and Core Video, you get real time graphics and video
processing!!! So i don't really think that what is shown in this video is new or a breakthrough (sorry!!!!), particularly when it is still a research project when CoreImage and CoreVideo already does even more and have been available for more than 1 year now.
I would really advice people interested in Accelarator to have a look to CoreImage, CoreVideo too, they will find a state of the art GPU based data precessing and data-paralellism technology. Its not the future, its now.... Last point, in the video there is
something that i don't agree. One of the guy said that scientific computing could be done on GPUs. I don't really think so, at least depending on you needs. I am geophysicist, specialist in fluid modelling and continuum mechanics. In most (if not all) scientific
modelling work, double precision math is required to achieve acceptable precision for the results. The problem is that CPUs do not provide double precision floating point numbers support in their execution unit. They do provide only (so far!!) simple precision
math as it is enough for 3D modelling and games. What i mean is that the vector units in the GPU (yes GPUs use a SIMD model for their execution unit, that's why they can achieve a high order of parallelism in data processing data) only support single precision
floating point numbers. This is not enough for most of the scientific applications today. Now there are many research out there on how to use GPUs for non-graphical calculation involving large sets of data, but so far nothing really usuable forr scientifc
computing. Apple had similar problem with Altivec becasue it does not support double precision floating point vectors, which prevent the G4 to provide vector computing for double precision floating point numbers. Some of the Accelerate APIs can do some double
precision operations on Altivec but it was limited to some specific operations like double precision Fourrier transform. So the GPUs have therefore a similar problem, they do not scale well for double precision floating point computing which limits their use
in scientific computing. On the other hand, this does not mean that some interesting work can not be done with the GPUs outside of the graphics world. There are some proposals on taking advantage of the GPUs power to encode or decode MP3 files, MPEG4 files,
etc... Some ATI card do some H264 decoding in hardware but we could imagine to use the GPU to also encode H264. Another application is of course animation. Animation does require a lot of data paralelism computing, and GPUs can help a lot in that. Leopard
Core Animation is a good application of what can be done.

If i look to this video i can not prevent myself to think that there is nothing new here and i am quite surprised to see that Microsoft is so late in this

I mean Apple has been working on APIs for SIMD programming for many years that provide data parallelism for image processing, scientific application, signal processing, math computing, etc..... This API is called Accelerate framework and it just do all the
job for the developper

Hakime -

The libraries that you mention are pre-compiled functions that use short-vector instruction sets (such as SSE3 or Altivec). For example, they include a function that does convolution. In contrast, Accelerator provides you with primitive operations that
are a level below a domain-specific library function. For example, you can do element-wise addition of 2 data-parallel arrays of 1 or 2-dimensions. These operations can be used to construct domain-specific library functions, such as the convolution
function.

We have a paper available on our Wiki that describes in detail the kinds of primitives that Accelerator provides and the compilation approach that we use to generate reasonably efficient GPU code.

The point of Accelerator to use data-parallelism to provide an easier way of programming GPUs and multi-cores, not to provide a set of domain-specific libraries.

You are correct that single-precision arithmetic will limit the use of GPUs for scientific computation. However, there are still lots of interesting things that you can do. You can look at
http://www.gpgpu.org for more information (under "categories", look at "scientific computation"). There has also been some recent work on emulating double-precision floating point numbers using single-precision floating
point numbers.

First a question: what were all those programs used? I think I saw cygwin and emacs being used, but what was the shell that would highlight the old commands on mouse over?

And then a comment: my problem with offloading stuff to the GPU is that the numerical environment is a joke. You don't know anything about the radix, the range of fp values, if +, -, *, /, and sqrt follows the sane rounding rules of IEEE, controlling any reorderings
or fusions (i.e. a*b+c -> fma(a,b,c)) that are allowed to take place, NaNs, -0, Inf (if so, affine or projected?), what happens on overflow/underflow/etc., all the nice functions in the latest draft IEEE standard or C99, controlling directed rounding, etc.,
etc., etc.

It's fine for doing things like CoreImage or accelerating game physics, but not for some of the things I'd love to offload that requires careful analysis. I hope you guys nag the DX people (at least) for some pragmas or mode that will tighten up the fp environment
and/or the ability to set/query anything interesting (see limits.h or float.h from C99 as an example).

And the use of functional type stuff scares me. Do you guys automatically break data down into smaller tiles to keep the memory usage more manageable?

************** JIT Debugging **************To enable just-in-time (JIT) debugging, the .config file for thisapplication or computer (machine.config) must have thejitDebugging value set in the system.windows.forms section.The application must also be compiled with debuggingenabled.

This is a rather old post, but I'm wondering where this project stands these days.

In light of projects like LINQ offering up expression trees that can now be interpreted and compiled into a completely different language and/or transferred off to be executed on a totally diff. piece of hardware, I'm kinda hoping this project picked up with
that and basically implemented LINQ to GPUs.

I started wondering when we'd see this with specific respect to WPF and a true approach to writing custom shader effects when I realized that LINQ could enable this kind of capability.
I finally got around to writing a blog post about it and somebody alerted me to this project.

super awsome id just like to bump this thread and ask how this project is doing now?how is this related to the shader stuff in wpf thats coming up? chales, a new interview with these guys whould be so cool

Remove this comment

Remove this thread

Comments Closed

Comments have been closed since this content was published more than 30 days ago, but if you'd like to continue the conversation,
please create a new thread in our Forums, or
Contact Us and let us know.