I am working on a project that requires the manipulation of enormous matrices, particularly pyramidal summation for a copula calculation.

In short, I need to keep track of a relatively small number of values (usually a value of 1, and in rare cases more than 1) in a sea of zeros in the matrix (multidimensional array).

A sparse array allows the user to store a small number of values, and assume all undefined records to be a preset value. Since it is not physically possibly to store all values in memory (greater in number than the number of particles in the universe :p ), I need to store only the few non-zero elements. This could be several million entries, I currently work on a system that uses a binary search tree (b-tree) to store entries.

Does anyone know of a better system?

EDIT: Speed is a huge priority.

EDIT 2 : I like that solution. How would I go about dynamically choosing the number of variables in the class at runtime? [edit by MH: good question, updated in the answer]

what about performance of getting the element range from this or checking if the range is fully in the array?
–
aloneguidMar 20 '12 at 11:18

1

The implementation of operator< is incorrect. Consider Triple{1,2,3} and Triple{3,2,1}, neither will be less than the other. A correct implementation would check x then y then z sequentially instead of all at once.
–
WhanheeJun 27 '14 at 15:14

Since it hadn't been fixed for an extended time, I took the liberty to replace it with a correct implementation.
–
celtschkMay 6 at 7:34

Just as an advise: the method using strings as indices is actually very slow. A much more efficient but otherwise equivalent solution would be to use vectors/arrays. There's absolutely no need to write the indices in a string.

However, using a map isn't actually very efficient because of the implementation in terms of a balanced binary search tree. Much better performing data structures in this case would be a (randomized) hash table.

You want SparseLib++ (http://math.nist.gov/sparselib++/) or Pardiso(http://www.pardiso-project.org/manual/index.html). Pardiso is C so you'll have to create some wrapper C++ code. High performance matrix libraries are one of those things that seem simple to implement but aren't, so it's best to go with something created by teams of scientists. Also, look into "column compressed storage"(using arrays) for one of the most efficient ways to store large matrices in memory.

LAPACK is a library for dense matrices. The standard BLAS is also for dense matrices. There is a Sparse BLAS package (through NIST) but this is different then the standard BLAS.
–
codehippoAug 8 '09 at 12:46

Since only values with [a][b][c]...[w][x][y][z] are of consequence, we only store the indice themselves, not the value 1 which is just about everywhere - always the same + no way to hash it. Noting that the curse of dimensionality is present, suggest go with some established tool NIST or Boost, at least read the sources for that to circumvent needless blunder.

If the work needs to capture the temporal dependence distributions and parametric tendencies of unknown data sets, then a Map or B-Tree with uni-valued root is probably not practical. We can store only the indice themselves, hashed if ordering ( sensibility for presentation ) can subordinate to reduction of time domain at run-time, for all 1 values. Since non-zero values other than one are few, an obvious candidate for those is whatever data-structure you can find readily and understand. If the data set is truly vast-universe sized I suggest some sort of sliding window that manages file / disk / persistent-io yourself, moving portions of the data into scope as need be. ( writing code that you can understand ) If you are under commitment to provide actual solution to a working group, failure to do so leaves you at the mercy of consumer grade operating systems that have the sole goal of taking your lunch away from you.