__color__,__group__,ticket,summary,component,version,milestone,type,owner,status,created,_changetime,_description,_reporter
3,Active Tickets,2,No Bool and Char arrays with the CUDA backend,CUDA backend,0.8.1.0,,defect,chak,new,2010-07-18T12:41:11Z+0100,2010-10-13T05:41:52Z+0100,"The CUDA backend can currently not handle arrays that contain elements of type `Bool` or `Char`. In `D.A.A.Array.Data`, see the instance declarations for `ArrayElem Bool` and `ArrayElem Char` for details.",chak
3,Active Tickets,18,Permute does not properly write-combine results,CUDA backend,0.8.1.0,,defect,,new,2010-07-24T11:16:09Z+0100,2010-10-13T05:43:13Z+0100,"When one or more threads try to write to the same location, the hardware write-combining mechanism accepts one transaction and rejects all others. The `permute` operation does not currently take this into account.
{{{
main :: IO ()
main = do
putStr ""Interpreter : "" ; print (Interp.run accumulate)
putStr ""CUDA : "" ; print =<< (CUDA.run accumulate)
accumulate :: Acc (Vector Int)
accumulate = Acc.permute (+) dst (idx Acc.!) src
where
src = Acc.use $ Acc.fromList 16 (repeat 1)
idx = Acc.use $ Acc.fromList 16 [0,0,3,2,1,1,2,1,3,3,1,0,0,2,1,1] :: Acc (Vector Int)
dst = Acc.use $ Acc.fromList 4 (repeat 0)
}}}
Which results in:
{{{
*Test> :main
Interpreter : Array 4 [4,6,3,3]
CUDA : Array 4 [1,1,1,1]
}}}
Compute 1.0 devices do not support any atomic primitives. At least for integral types, we can work around this by tagging each transaction with a thread ID (or similar). This requires many additional memory transactions and wastes the upper bits. attachment:permute_tag.inl
For devices of compute 1.1 and greater, we can use atomic compare-and-swap. This is limited to 32-bit and 64-bit [unsigned] integers, but doesn't require any additional transactions (assuming the internals are intelligent). I was however unable to convince nvcc to reinterpret the bits of a float as an int (say), but in principle we should be able to do this... attachment:permute_atomic.inl
",tmcdonell
3,Active Tickets,39,./Data/Array/Accelerate/CUDA.hs:48 (unhandled): CUDA Exception: invalid argument,CUDA backend,0.8.1.0,,defect,tmcdonell,new,2010-12-14T12:17:43Z+0000,2010-12-15T10:32:50Z+0000,"Hello.
I was trying to build/use the current accelerate version,
and run into some problems.
I am using ghc-6.12.3 on Linux 2.6.32-bpo.5-amd64 (debian 5.0.6)
with cudatoolkit_3.2.16_linux_64_ubuntu10.04 and gcc-4.5.1
and my hardware is GTX295
* cabal install accelerate
is trying to build accelerate-0.8.1.0
and it starts with building the dependency cuda-0.2.2
which fails with
{{{
c2hs: Errors during expansion of binding hooks:
./Foreign/CUDA/Driver/Context.chs:76: (column 15) [ERROR] >>> Unknown identifier!
Cannot find a definition for `cuCtxCreate' in the header file.
}}}
* cabal install cuda
builds cuda-0.3.2 and the installation runs through.
* cabal unpack accelerate
and manually remove the ""cuda < 0.3"" dependency
then cabal install runs through.
* cd accelerate-0.8.1.0/examples/simple ; make ; ./test
looks OK
* cd accelerate-0.8.1.0/examples/rasterize ; ghc --make rasterize
{{{
[1 of 2] Compiling RasterizeAcc ( RasterizeAcc.hs, RasterizeAcc.o )
RasterizeAcc.hs:26:9:
Not in scope: type constructor or class `NFData'
}}}
I manually add ""import Control.DeepSeq"" to RasterizeAcc.hs
then ""ghc --make"" succeeds
./rasterize prints
4 * Haskell (pass), 4 * (Accelerate interpreted) pass
So I figure this isn't using cuda at all.
I change Data.Array.Accelerate.Interpreter to
Data.Array.Accelerate.CUDA
in both RasterizeAcc.hs and rasterize.hs
Then ""ghc --make rasterize"" is OK,
but running ./rasterize gives
{{{
rasterize-test1.txt (Haskell) - pass
rasterize-test2.txt (Haskell) - pass
rasterize-test3.txt (Haskell) - pass
rasterize-test4.txt (Haskell) - pass
rasterize:
*** Internal error in package accelerate ***
*** Please submit a bug report at http://trac.haskell.org/accelerate
./Data/Array/Accelerate/CUDA.hs:48 (unhandled): CUDA Exception: invalid argument
}}}",j.waldmann
3,Active Tickets,38,Create user defined data sturcture as instance of Elem,Accelerate language,,,enhancement,chak,new,2010-12-07T08:59:46Z+0000,2010-12-07T08:59:46Z+0000,"Hi, I'm trying to use accelerate to write multipole solver using CUDA as backend.
I wanted to create user defined data structure like:
{{{
data Panel b a
= Panel { getBounds :: b
, getSrcBounds :: (Int, Int)
-- , getFarExpan :: Array Int a
}
deriving (Show, Typeable)
}}}
But I found that it's hard to declare it as an instance of `Elem' because its methods are hidden.
I understand that user can define a type using nested (,). But it would be nice to expose `Elem' to end user so user can use any type as `Elem'. Is there any concern about this?",fxie
3,Active Tickets,32,OpenCL Backend,Accelerate language,,,feature request,chak,new,2010-08-23T08:43:46Z+0100,2010-08-24T12:45:15Z+0100,"Since OpenCL is a standard which works across Graphics-Hardware Vendors, a backend for accelerate using this should make the library useful for a broader range of people.",anonymous
3,Active Tickets,37,"Support for ""combining"" operators",Accelerate language,0.8.0.0,,feature request,chak,new,2010-09-13T02:58:15Z+0100,2010-09-13T02:58:15Z+0100,"The ability to combine multiple arrays into a single one can currently only be achieved using zip/zipWith. Of course, these arrays can only combine 2 arrays at a time, so in general it would be useful to have operators for combining more than 2 arrays in more elaborate, but structured, ways.
Some requirements could be:
* 'combine' operator:
* a generalised array combining operator
* sum of input arrays sizes is equal to output array size - i.e. no elements are lost or duplicated
* no permutation is performed on elements within in input array - i.e. output array elements are contiguous with respect to their source input array
* input array elements must all be of the same type and shape (shape would need to be a run-time check)
* the combination does not need to preserve dimensionality - e.g. multiple 1D arrays could be combined to produce another 1D array (concatenation), or a 2D array (stacked) or even ""maybe"" a 3D array (stack-stacking?)
* 'append' operator:
* a specialised array combining operator
* two input arrays - place one array at the ""end"" of the other
* input array elements of the same type
* the higher dimensions of the input arrays must have the same extent
In addition to fulfilling the need of a common pattern, combining operators would allow for further backend optimisations:
* input arrays to a combine operator would not require intermediate writing to memory - they can be written directly (by their producer) to their location within the combined output array
* on architectures such as Fermi, the generation of the input arrays can be done in parallel by using separate streams - 'combine' in this case acts as synchronisation barrier until all computations are complete
",blever
3,Active Tickets,40,liftAcc in CUDA.Execute doesn't handle Let and Let2,Accelerate language,,,missing functionality,tmcdonell,new,2010-12-21T05:02:26Z+0000,2010-12-21T05:02:26Z+0000,"When recovering sharing of computations involving `scanl'`, `scanr'`, etc whose results are used in `size`, `shape`, or `(!)` functions, let-bindings may be encountered by `D.A.A.CUDA.Execute.liftAcc`, which panics as it currently doesn't handle `Let` and `Let2` AST nodes.",chak
3,Active Tickets,34,CUDA backend does not support 'stencil',CUDA backend,0.8.1.0,0.9 release,missing functionality,tmcdonell,new,2010-08-28T05:57:53Z+0100,2010-12-21T05:15:36Z+0000,,tmcdonell
3,Active Tickets,41,Comprehensive support for standard Haskell classes and numeric conversions,Accelerate language,0.9.0.0,0.9 release,missing functionality,chak,new,2010-12-21T05:09:20Z+0000,2010-12-21T05:13:58Z+0000,,chak
4,Active Tickets,28,reduced performance of small types,CUDA backend,0.8.1.0,,defect,,new,2010-08-22T08:44:59Z+0100,2010-10-13T05:46:59Z+0100,"CUDA devices do not coalesce memory transfers to global memory of 8- and 16-bit types. Without providing alternate skeletons that process multiple elements per thread (vec4 and vec2 types respectively), we may be able to promote these to 32-bit transactions, and mask off the irrelevant data. Similar issues exist for shared memory bank conflicts.",tmcdonell
4,Active Tickets,36,scan operations hang indefinitely on devices with Compute Capability 1.0,CUDA backend,0.8.1.0,,defect,,new,2010-09-11T03:54:56Z+0100,2010-10-13T05:47:42Z+0100,"{{{scan_intervals}}}, in a for-loop, has a {{{__syncthreads()}}} and calls another device function {{{scan_block}}} that has a bunch of {{{__syncthreads()}}}. As threads do not exit the loop all at once, scan operations hang indefinitely at {{{__syncthreads()}}} on devices with Comput Capability 1.0.",seanl
4,Active Tickets,42,FFI support to interface with existing CUDA code,Accelerate language,0.9.0.0,,defect,chak,new,2010-12-21T05:10:58Z+0000,2010-12-21T05:10:58Z+0000,,chak
4,Active Tickets,31,support concurrent kernel execution on Fermi architectures,CUDA backend,0.8.0.0,,enhancement,,new,2010-08-23T01:51:05Z+0100,2010-08-23T01:51:18Z+0100,,tmcdonell
4,Active Tickets,3,Unhelpful error when using Double on pre-1.3 CUDA devices,CUDA backend,0.8.1.0,,defect,None,new,2010-07-18T12:50:10Z+0100,2010-10-13T05:46:21Z+0100,"Only CUDA devices with compute capability 1.3 and up contain hardware support for Double. Currently, Accelerate will nevertheless generate code for Doubles for these devices, which leads to a failure further down the pipeline. This is not very user-friendly.
This raises the general questions of how to handle device capabilities elegantly.",chak
3,Active Tickets,8,Sharing is lost,Accelerate language,0.8.1.0,0.9 release,missing functionality,chak,assigned,2010-07-18T13:32:46Z+0100,2010-12-21T05:14:55Z+0000,"Currently the frontend fails to preserve sharing in Accelerate expressions, which leads to unnecessary recomputation of shared values.",chak
4,Active Tickets,26,Internal error in filter test with criterion,CUDA backend,0.8.0.0,,defect,tmcdonell,assigned,2010-08-18T06:39:41Z+0100,2010-08-18T10:09:45Z+0100,"On my MBP with a NVIDIA !GeForce 9400M (256MB VRAM) and CUDA 3.1, I get
{{{
benchmarking filter/cuda
collecting 100 samples, 1 iterations each, in estimated 5.192780 s
test: *** Internal error in package accelerate ***
*** Please submit a bug report at http://trac.haskell.org/accelerate
./Data/Array/Accelerate/CUDA.hs:45 (unhandled): CUDA Exception: invalid argument
}}}
The filter test runs fine in the validation phase. It only dies with criterion (probably as that tests a wider range of inputs).",chak