Andy "Krazy" Glew is a computer architect, a long time poster on comp.arch ... and an evangelist of collaboration tools such as wikis, calendars, blogs, etc. Plus an occasional commentator on politics, taxes, and policy. Particularly the politics of multi-ethnic societies such as Quebec, my birthplace.

The content of this blog is my personal opinion. It is not that of my employer. See Disclaimer.

Photo credit: http://docs.google.com/View?id=dcxddbtr_23cg5thdfj

Disclaimer

The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

If the arithmetic is 2's compliment, signed or unsigned, without overflow, this can be evaluated in any order. Indeed, one need not perform carry propagating adds for the products; we can combine the sum of the products with the sums involved in multiplication, in a redundant, carry-free, form.

However, if we are to detect overflow, or equivalently perform saturation, issues are raised. How are the calculations associated:
(((A0*B0 + A1*B1) + A2*B2) + A3*B3)
or
((A0*B0 + A1*B1) + (A2*B2 + A3*B3))

Should we detect intermediate overflow, or should we ignore overflows that are "corrected" by subsequently adding in numbers of opposite sign. I.e. should we perform calculations as if using extended precision intermediate accumulators or temporary values.

If saturating, e.g. if using leftmost association, when we saturate for overflow should we stay saturated, or should we allow the saturated value to be brought back into range? Etc.

All of these boundary cases are annoying.

Glew recommendation: for small vectors, either don't do overflow detection, or calculate as if in extended precision and detect at the overflow at the end.
* Pro: easy to describe and understand
* Con: gives different results compared to doing scalar calculations in straightforward loops (with overflow detection)

Consider calculations as if in a specific order, typically leftmost or tree.

A hybrid approach is possible: calculate in the fastest possible way, e.g. tree, but then have a slower calculation detecting overflow. CON: extra power.

= Floating Point Dot Product =

Floating point dot product instructions have the same issues of order dependence as described for integer dot product with overflow detection - with the additional issue of rounding. Even in the absence of overflow,
(((A0*B0 + A1*B1) + A2*B2) + A3*B3)
or
((A0*B0 + A1*B1) + (A2*B2 + A3*B3))
may give different answers.

As before, some instruction sets define particular orders of evaluation, while some leave it undefined. Some also define an "infinite intermediate precision" form of inner product, where the sums are formed without rounding, and only rounded (and overflow detected) at the end.

[[Superaccumulators]] are a more general form of such arithmetic.

In the absence of such special support, floating point dot product instructions
with linear, e.g. leftmost, associativity are little faster than a succession of [[FMA]] or [[FMAC]] instructions.

= Alternative Inner Product Support =

Alternative instructions that can be used to synthesize dot products and other
possibly order dependent [[vector reductions]]
may be motivated by the issues discussed above.

Pairwise add could, of course, be done on a single vector operand, but then the
output would only be half-length.
Combining two vector operands in this manner produces a full-length vector output,
which is convenient.

Wiring-wise, if the A and B vector operands are interleaved
to allow [[vertical add]], this only requires wires sufficient to
route A1 to A0, and vece versa.

E.g. Intel SSE3's PMADDUBSW.
The putative problem of having only half as many vector elements in the PMADDUBSW output is
handled by doubling the precision,
taking 8 bits as input and producing a packed 16 bit vector as output.
Unfortunately, that technique does not work so well for floating point.

= Vector Parallel versus Vector Sequential =

Defining dot product with leftmost associativity
for purposes of overflow and rounding,
is well suited to a [[vector sequential]] implementation.

[[Vector parallel]] implementations are better suited to horizontal adds,
and other parallel orders of summation.

Modern instruction sets such as SSE tend to have [[vector parallel]] implementations.

It is possible to mix vector parallel and vector sequential,
although a latency penalty is paid when transitioning from vector sequential to vector parallel.

Thus we begin sliding down [[The Slippery Slope of Vector Instruction Sets]]:
there are so many different possible definitions of the vector instructions,
with different evaluation orders,
suited to different microarchitectures.
It is easy to define the most basic vector instructions.
Completing the coverage of vector instructions is more challenging.