I have published a first article about how to build a quaternion from two arbitrary direction vectors that will transform one of the directions into the other. That article was deliberately omitting the special case when the two vectors were facing away from each other, which required special treatment. So I wrote a second article explaining how to quickly pick an orthogonal vector.

Combining the results from the two articles should be straightforward. Here is the resulting, final function:

Whichever way you look at the sphere of unit vectors, a continuous tangent field will always have a zero point:

Most of the maths operations we usually do (adding, multiplying, taking square roots, etc.) are continuous. Fortunately, we are not limited to them: we can use a simple if() construct to create a discontinuity that will save us.

A very good implementation

Here is my personal implementation of the orthogonal vector problem. It is very good. It has excellent numerical stability. It only does one test. It is short. It is beautiful. I really hope you consider using it.

Many implementations, such as Ogre3D’s, have additional tests or operations to perform this task, and use epsilons and normalised vectors to avoid singularities.

But our only test is actually enough: if |x|>|z|, it means the largest component in absolute value in v is either x or y. So we build a vector using x and y. Otherwise, the largest component in absolute value is either y or z. So we build a vector using y and z. That way, we ensure that the length of our returned vector never, ever comes near zero.

Whether some given code will cause inefficient branching is often unclear. In our case, it is very likely that the ternary operator will actually be branchless, with some help of the compiler and the hardware.

That said, how about we try to get rid of the ternary operator, just for fun?

Going branchless for fun

Let’s see. We had these two candidate vectors, w1 and w2, which worked almost always except when specific values of v caused them to be zero. And whatever the constant k we may pick, a vector of the form w1 + kw2 will eventually take the value zero for some value of v, too, because of the hairy ball theorem.

Now here comes the trick. Instead of using a constant k, we use a function f(x,y,z). This is what our w vector looks like:

From now I shall assume that v is a unit vector.

What can cause w to be zero, then?

One necessary condition is y = 0. When y ≠ 0 we can chose anything we want for f(x,y,z), it’ll always give us a good orthogonal vector. This restricts the problem to the y = 0 circle, giving us the useful equality x² + z² = 1.

The other condition is x = z*f(x,y,z). So if we manage to build a function f such that f(x,y,z) never equals x/z, we win.

Using x² + z² = 1 we can plot all the possible values for x/z as a function of x. It will show us, for a given x, the values that fcannot take:

The almost vertical slopes you see go to infinity upwards and downwards. As expected, this prevents us from using a continuous function: it would obviously cross one of these walls at some point.

I find it highly unlikely that this second version will perform better than the branching one. However, I haven’t put much thought into it and maybe someone will come up with a much better solution using the ideas presented here. Have fun!

In this article I would like to explain the thought process behind the derivation of a widely used formula: finding a quaternion representing the rotation between two 3D vectors. Nothing really new, but hopefully a few ideas could be reused at other times.

This is more or less the code used by the ​Ogre3D engine in OgreVector3.h, except they perform an additional normalisation step on the final result. This is mathematically useless, but due to numerical stability issues, it is probably safe to do so nonetheless.

This final normalisation step is actually an opportunity to simplify the code even further.

Improving on Ogre3D

We are down to two square roots and four divisions, plus quite a few mul/adds. Depending on the platform that we are running on, it is possible to simplify even further and improve performance. For instance, on many SIMD architectures, normalising a quaternion can be very fast.

This is the code we get if we multiply every component of the quaternion by 2.f * half_cos and let normalize() do the rest of the job:

Isn’t it beautiful, considering the sin(), cos() and acos() ridden mess we started with?

This algorithm can be found all over the Internet, but I do not know who first came up with it. Also, a lot of 3D engines (both publicly available and slightly more private) could benefit from it.

Update (06/01/2014)

In the comments below, Michael Norel provides the following improvement to the non-unit version. Since the values d = dot(u, v) and w = cross(u, v) are computed no matter what, the value sqlength(u) * sqlength(v) could be computed in a different way, *i.e.* d * d + sqlength(w). The following code does at least three multiplications less:

Some time ago I devised an original algorithm to convert from RGB to HSV using very few CPU instructions and I wrote a small article about it.

When looking for a GLSL or HLSL conversion routine, I have found implementations of my own algorithm. However they were almost all straightforward, failing to take full advantage of the GPU’s advanced swizzling features.

Four min/max operations are performed to find rgb_max and rgb_min; however, sorting three values can be done with only 3 comparisons. This is not necessarily problematic because min/max could be wired in an efficient way depending on the CPU.

Two additional tests are performed to compare r and g to rgb_max; if rgb_max and rgb_min were computed using tests, this is a waste of time to compare them again.

Adding 6.f to the final hue value only has a 16.6% chance of happening.

The actual hue calculation depends on how r, g, and b are ordered:

But let’s rewrite this in terms of x, y and z, where x is the largest of (r,g,b), z is the smallest of the three, and y is inbetween:

There are a lot of similarities here. We can push it even further, using the fact that x ≥ z and y ≥ z by definition:

That’s actually the same calculation! Only the hue offset K changes. The idea now is the following:

You can check for yourself that the values for K explicited above are properly generated by that function. There were many other ways to sort (r,g,b) but this specific one lets us do one final optimisation.

We notice that the last swap effectively changes the sign of Kand the sign of g - b. Since both are then added and passed to fabs(), the sign reversal can actually be omitted.

That’s 2 tests and 1 std::min call instead of the previous 3 tests and 4 std::min/max calls. We really should see some kind of performance gain here.

And as expected, benchmarks indicate a performance increase of 25 to 40 % with a great variety of CPUs, compilers and compiler flags. The following graph (average nanoseconds per conversion) is on a Core i7-2600K CPU, using g++ 4.7.2 with -O3 -ffast-math: