The Khronos Group - a non-profit industry consortium to develop, publish and promote open standard, royalty-free media authoring and acceleration standards for desktop and handheld devices, combined with conformance qualification programs for platform and device interoperability.

If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

Multiple access to global memory

Dear all,

In my code, the several threads need to read from the global memory a lot of variables with the same address. Unfortunately, the size of the varibles is too large in order to fit them all in local memory. As a consequence, reading these variables takes 80% of the time, even if it represents only less than 5% of the instructions.
Can anyone suggest a way to speed up the access to these shared variables?

(my procedure is somehow similar to the multiplication of two matrices)

Re: Multiple access to global memory

Optimization is very specific to the hardware you're targeting, and also to the problem. Without much more detail you're only going to get vague answers. Some of the things that are generally a good idea on a GPU when accessing global memory:

Ensure that memory accesses are coalesced. That means that each thread should access memory that immediately follows that of the previous thread.[/*:m:19866tvl]

If the GPU doesn't have an L1 cache (e.g. NVIDIA prior to Fermi), copy a chunk of data into shared memory and then work on it before loading another chunk. This is particularly useful if the memory is reused, as occurs in matrix multiplication.[/*:m:19866tvl]

Put the data in an image and access it through a sampler. [/*:m:19866tvl]

Depending on how similar your operation is to matrix multiply, try reading some of the papers on it e.g. Google for Volkov matrix multiply or Nakasato matrix multiply.