The Khronos Group - a non-profit industry consortium to develop, publish and promote open standard, royalty-free media authoring and acceleration standards for desktop and handheld devices, combined with conformance qualification programs for platform and device interoperability.

If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

The GPU doesn't "know" that "i" is a row and not a column; it is up to your kernel to use it correctly for its intended use. It is just a work item index.

The matrix and vector are arrays of float4s. Each float4 (obviously) has 4 elements. The size of the array is determined by the host when it allocates the buffer (you didn't include that code so I don't know).

The host needs to tell the kernel how many elements to process. It can do this via the global work size, which is the "work_units_per_kernel".

The GPU doesn't "know" that "i" is a row and not a column; it is up to your kernel to use it correctly for its intended use. It is just a work item index.

The matrix and vector are arrays of float4s. Each float4 (obviously) has 4 elements. The size of the array is determined by the host when it allocates the buffer (you didn't include that code so I don't know).

The host needs to tell the kernel how many elements to process. It can do this via the global work size, which is the "work_units_per_kernel".

I hope that helps.

Dear Dithermaster,

In my project the input to the kernel in N(rows)xM(columns) float matrix.
N,M will be passed as arguments to the kernel.

The input contains also a vector with N elements.
Now I have to call the dot product of specfic column in the matrix x vector.
In memory, rows are contiguous and columns are oviously not.

I can extract a column from the matrix with a loop.
Is there another faster way ?

If you pass your matrix as a buffer of floats (not float4s) then you can access any element easily:

float element = buffer[column * row_size + row];

The example you looked at is probably specifically for 4x4 matrixes, and then using a float4 makes good sense since it can help leverage the vector architectures.

In terms of "the fastest way": the two things to keep in mind to make this fast:
1. Coalesced memory access: It is very important that work items executing in parallel access nearby memory, preferably adjacent, for best memory bandwidth. For example, work item 0 accessing buffer[0] and work item 1 accessing buffer[1]. If your have it set up instead that work item 0 accesses buffer[0] but work item 1 accesses buffer[row_size] then you will not get good performance (since the GPU may read 128 bits at a time and then will discard much of it).
2. Shared local memory caching of data that will be used across many work items. Matrix multiply is often used as an example of doing this, since each data element gets seen by every row and every column. A naive algorithm will read each element N*M times but a algorithm that can cache in fast shared local memory can reduce that significantly. It makes the algorithm a bit more complicated though. Study the numerous matrix multiply examples to get a good understanding of how to leverage the limited (48K sometimes) shared local memory for best advantage.