Matrix operations on large datasets

From:

Ed

Subject:

Matrix operations on large datasets

Date:

Fri, 30 May 2008 21:40:12 +0100

(To avoid polluting the standard thread I copied this part across)
> 2008/5/30 Ben Pfaff <address@hidden>:
> But I'm not sure why you think that supporting large data sets
> has to be highly complex. A lot of it is just Computer Science
> 101 type stuff; for example, merge sorts for efficiently sorting
> large data sets (which we already implement). What do you have
> in mind?
I guess I was thinking of matrix operations in particular. I've been
reading the linear regression code as Jason suggested, and the best
way to implement linear algebra on a large dataset isn't immediately
clear to me.
My recollection of the maths in this area is a little hazy. The
obvious cases we might need seem to be:
vector <dot> vector:
is straightforward (stream in matching chunks of the two vectors).
working memory is 2* chunk size + 1 (for the total)
matrix x vector
the matrix could be wide or high (we assume its very nonsquare since
otherwise no algorithm can probably run in useful time anyway). All
this basically changes is the order we stream things in. If wide, we
stream across the matrix, calculating all n result elements
simultaneously. If high, we stream down the matrix, calculating each
of the n result elements one after another and saving the result to an
output stream.
memory usage:
wide: n totals + 2n chunks.
high: 1 total + m locations (much cheaper I guess)
matrix x matrix
In the case (n x m) x (m x p) where n,p <<< m, can we build the entire
result matrix as we stream the inputs across m?
The other cases I'm not thinking so clearly about...
Is there an existing library that would do this for us?
This is the sort of thing that seems complicated to me anyway.
Ed