numerical problem: efficient euclid distance with very large numbers

Hi all,

Iīm looking for an efficient solution for the following problem:

Let V be a set containing m vectors V={v1,v2,...,vm}. Given an input vector vi I need to find the vector from V which has the smallest euclid distance to vi (find best matching unit).
Normally not an issue. However, every element of Vīs vectors and the input vector vi always have a value from the open interval (0,1).

example 2D vector:
v1 = (0.0123123423423 , 0.00023567887)

Assume that every number has several thousand or even million digits. So practically spoken, if you implement euclid distance for this kind of numbers, you will need special high precision libraries, where every mathematical operation (like add, mult or sqrt) will run many times slower than with primitives such as double.

Recently, I had the idea to tokenize a long number into smaller peaces and execute the euclid distance only on this peaces.
so for instance lets assume we are in 1d (for simplicity) and we have the following two vectors v1, v2 and the input vector vi

v1 = 0.001222345566
v2 = 0.123456649900
vi = 0.1923343233232

lets say we are tokenizing the numbers in peaces of length 5, then the first token (t1) for each vector would look like this:

here we see that the 5 first numbers of v2 are closer to vi than v1īs. Thus whatever comes after this 5 numbers, v2 will always be closer to vi. And thus vi is the best matching unit.
If the first tokens of v1 and v2 had the same (or very close) distance to vi, then the the second token containing the next 5 numbers would be compared.

The questions:

does this approach make sense ? (with approach I mean tokenizing the numbers and comparing them part by part)
Am I overlooking anything fundamental ?
How can this idea be improved ?

edit @ 18.12.07, 17:56 :
for new readers and future responses : Please accept the fact that Iīm using huge numbers, this is simply a requirement and canīt be changed.

thanks in advance,
Andreas

Last edited by sirandreus; Dec 18th 2007 at 08:57 AM.
Reason: some corrections again

Let V be a set containing m vectors V={v1,v2,...,vm}. Given an input vector vi I need to find the vector from V which has the smallest euclid distance to vi (find best matching unit).
Normally not an issue. However, every element of Vīs vectors and the input vector vi always have a value from the open interval (0,1).

example 2D vector:
v1 = (0.0123123423423 , 0.00023567887)

Assume that every number has several thousand or even million digits. So practically spoken, if you implement euclid distance for this kind of numbers, you will need special high precision libraries, where every mathematical operation (like add, mult or sqrt) will run many times slower than with primitives such as double.

Recently, I had the idea to tokenize a long number into smaller peaces and execute the euclid distance only on this peaces.
so for instance lets assume we are in 1d (for simplicity) and we have the following two vectors v1, v2 and the input vector vi

v1 = 0.001222345566
v2 = 0.123456649900
vi = 0.1923343233232

lets say we are tokenizing the numbers in peaces of length 5, then the first token (t1) for each vector would look like this:

here we see that the 5 first numbers of v2 are closer to vi than v1īs. Thus whatever comes after this 5 numbers, v2 will always be closer to vi. And thus vi is the best matching unit.
If the first tokens of v1 and v2 had the same (or very close) distance to vi, then the the second token containing the next 5 numbers would be compared.

The questions:

does this approach make sense ?
Am I overlooking anything fundamental ?
How can this idea be improved ?

Note that the solution to this problem does not need to be exact. Thus approximations are welcome as well

thanks in advance,
Andreas

If it does not need to be "exact" then why not just truncate all the numbers
to 10 significant digits and work with those in standard double precission
floating point?

the representation of the numbers must be in a high precision, but practically spoken "not exact".

I cant just consider the first 10 digits, see counterexample:

0.00000000000002354534534
0.00000000000001123545455

You can not compare this two numbers, if you cut everything but the first 10 positions.

In my original idea (see first post), you would first compare the 10 first numbers, which are equal. Because they are equal you would compare the next 10 positions (assuming you are moving in steps of 10).
There you would compare:

Working with say digit precision the probability of two distances beingindistinguishable is (with a bit of scalling and to a hand wavingapproximation), so you will need a lot of vectors before the chance of encountering a problem becomes significant.

In short words: having an (0,1)^n as an inputspace, and k vectors v1,...,vk in that space, this k vectors are mapped uniquely into one vector in (0,1)^n.
This works by interleaving the decimal expansions of the vectors and thus a very huge vector can be created.

Sorry CaptanBack, but I canīt figure out how your statement about indistinguishable distances is related to my problem
As I understand you are talking about the probability that 2 vectors have the same digits (with a length of 13). In fact, this probability is not relevant, because Im looking for the best matching unit (for a vector from the set V which is the closest (in terms of euclid distance) to the input vector v_i )

In short words: having an (0,1)^n as an inputspace, and k vectors v1,...,vk in that space, this k vectors are mapped uniquely into one vector in (0,1)^n.
This works by interleaving the decimal expansions of the vectors and thus a very huge vector can be created.

Sorry CaptanBack, but I canīt figure out how your statement about indistinguishable distances is related to my problem As I understand you are talking about the probability that 2 vectors have the same digits (with a length of 13). In fact, this probability is not relevant, because Im looking for the best matching unit (for a vector from the set V which is the closest (in terms of euclid distance) to the input vector v_i )

kind regards,Andreas

You are doomed to failure unless you recast your problem.

Almost all reals in (0,1) have non-terminating decimal expansions, which
leaves you with infinitly long calculations.

The interleaving of the decimal expansions is an "in-principle process" in
practice it is impossible to do exactly.

Almost all reals in (0,1) have non-terminating decimal expansions, which
leaves you with infinitly long calculations.

RonL

Correct, but not an issue for me. Because I already have numbers in a finite representation before I do the interleaving.

I dont care what this numbers should exacly be. I only see the numbers and assume that they are accurate. From a practical and numerical perspective there is nothing else you can do. I mean if your sensor gives you a 0.33333333335 thats it, you canīt ask yourself if it could mean 0.3333333333333.....33333....

Correct, but not an issue for me. Because I already have numbers in a finite representation before I do the interleaving.

I dont care what this numbers should exacly be. I only see the numbers and assume that they are accurate. From a practical and numerical perspective there is nothing else you can do. I mean if your sensor gives you a 0.33333333335 thats it, you canīt ask yourself if it could mean 0.3333333333333.....33333....

cheers,
Andreas

1. Check your sensors real accuracy, 24 ADC's don't give 24 bit accuracy
may be 20 bits. You may not have as much real precision as you think.

2. Work with int64's (or if you have them int128)

3. Look at getting an arbitary precision arithmetic packge/class (google for
such for the programming language you intend to use).