Hi all,
I'm an Erlang newb and I'm contemplating building a system where various
documents need to be aggregated. A document is basically key/value pairs that
hold different elements of information. My question revolves around which
Erlang data structures I should use, and if Mnesia or a more traditional RDBMS
like PostgreSQL should be employed.
Each document might be more or less complete than the others in the entire
collection (more or fewer key/value pairs present), they change over time, and
can be hierarchical in nature. These requirements lead me to a list or tuple
or some such, and away from a RDBMS.
1. Erlang / Mnesia pseudo-code:
-record(doc, {
location,
week, %% or month, or some other time dimension
{doc_data, [{key1, val1},
{key2, val2},
...
{keyN, valN}]
}
}).
2. RDBMS schema (horizontal):
+----------+------+------+------+-----+------+
| location | week | val1 | val2 | ... | valN |
+----------+------+------+------+-----+------+
(difficult to 'change over time' and be 'hierarchical')
3. RDBMS schema (vertical):
+----------+------+-----+-----+
| location | week | key | val |
+----------+------+-----+-----+
or two tables:
+--------+----------+------+ +--------+-----+-------+
| doc_id | location | week | and | doc_id | key | value |
+--------+----------+------+ +--------+-----+-------+
(potential query inefficiency, difficult WHERE constraints for filtering out
missing values if need be, or filtering out stores based on data values - a
feature that would require touching each doc! )
I'm guessing most of the aggregations that are needed are sums and averages.
Week and location will be used to narrow the set down, based on a time period
and characteristics of the locations. However, as mentioned above, sometimes
locations are eliminated because of data values in the doc, like, say, "show
me only the top 20% locations for the key1 data value." Maybe key1 == sales
or something.
Questions:
----------
Which one should I use? or are there alternative structures in Erlang that I
haven't listed?
How would my decision change as N grew? I'm not sure what the overall
population of documents will be, but you gotta dare to dream that the world
will eat this up en masse :) Millions or billions of docs would be cool. I'm
aware of limits in Mnesia tables, but frankly, for performance, I'd be
partitioning the RDBMS tables as I would Mnesia ones.
Does Mnesia's in-memory, distributed, fault-tolerant, Erlang data structure
nature far surpass the RDBMS's more rigid structure, but long history of
optimization?
Would the Erlang / Mnesia approach plus a mapreduce type of system, spread
across many boxes help tilt the scales away from the RDBMS?
Sorry for the long post. It's kind of an important decision ;)
Cheers,
Brad