The Khronos Group - a non-profit industry consortium to develop, publish and promote open standard, royalty-free media authoring and acceleration standards for desktop and handheld devices, combined with conformance qualification programs for platform and device interoperability.

If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

In this simple case, memory bandwith is most probably the limiting factor.

Let's say for example you have memory with a peak bandwith of 12GB/s and a Intel i7 processor clocked at 2.40GHz.

For each addition, the memory will have to access 3 (single-precision) float values (2 reads and 1 write), that is 12 bytes.
Memory thus limits the number of additions to 12/12=1 Goperation/s.

On the CPU side, let's suppose that the for loop has been unrolled so that we can neglect index incrementing and conditional branching of the for loop.
The only instructions for an add operation are two loads from memory to registers, one addition and a store from register to memory. Supposing that memory accesses are done to and from the L1 cache and neglecting latency, a full operation needs:

So a full add operation will take 8 cycles, and the maximum number of operations is 2.40/8=0.3 Goperation/s.

(Of course, this is a very coarse estimation but it is enough to highlight the origin of the limitation)

So if four threads are concurrently computing, they will execute 1.2 Goperation/s which already exceeds the number of operations the memory can sustain (1 Gop/s). The load/store instructions will stall, waiting for data from/to memory.

In this simple case, memory bandwith is most probably the limiting factor.

Let's say for example you have memory with a peak bandwith of 12GB/s and a Intel i7 processor clocked at 2.40GHz.

For each addition, the memory will have to access 3 (single-precision) float values (2 reads and 1 write), that is 12 bytes.
Memory thus limits the number of additions to 12/12=1 Goperation/s.

On the CPU side, let's suppose that the for loop has been unrolled so that we can neglect index incrementing and conditional branching of the for loop.
The only instructions for an add operation are two loads from memory to registers, one addition and a store from register to memory. Supposing that memory accesses are done to and from the L1 cache and neglecting latency, a full operation needs:

So a full add operation will take 8 cycles, and the maximum number of operations is 2.40/8=0.3 Goperation/s.

(Of course, this is a very coarse estimation but it is enough to highlight the origin of the limitation)

So if four threads are concurrently computing, they will execute 1.2 Goperation/s which already exceeds the number of operations the memory can sustain (1 Gop/s). The load/store instructions will stall, waiting for data from/to memory.

Thanks..this is very helpful..
As I am a beginner to this computing, I was thinking of the following..Please advise...

") This might probably related to caching....in the beginning the processor doesnít have any information about the computation...then after a few computations the previous calculation is stored in the cache memory and this cache memory is reused for future/parallel calculations which decreases the time needed to retrieve the output....
Use of Cache memory for recalculation and parallel computation....which decreases the time taken by the processor to recalculate if the output is already available. So, once it goes beyond 10, the time didnít continue to decrease."

Modern CPUs have prefetchers that can detect sequential accesses like here and efficiently fill the cache before computations take place.
The problem here is that the computations are so simple and fast that the memory cannot fill the cache fast enough to sustain the rate at which FP computations are done.

Modern CPUs have prefetchers that can detect sequential accesses like here and efficiently fill the cache before computations take place.
The problem here is that the computations are so simple and fast that the memory cannot fill the cache fast enough to sustain the rate at which FP computations are done.