I am evaluating scidb for use in a spectrum database project where we will be storing spectrograms (power spectrum maps) that are gathered from sensors in scidb and querying them later in order to determine spectrum usage patterns.

i.e. For a given time and frequency I want to record the power and the sensor attributes of the sensor that made the measurement and later retrieve it for analysis.
Note that sensor type and location are repeated - i.e. many readings will have the same string except when the sensor is changed. It would be extremely convenient if I could store all of these attributes together so I have a single sparse array where each entry contains the attributes above.

My question is, if I repeat these strings, will scidb take care of managing storage so that I don’t blow up memory usage (i.e. will scidb detect that I am storing the same string again and increment a reference count internally etc. ) ? If not, I’ll have to manage repeated values myself and if so, I can rely on scidb to do storage management for me.

What are the native types for Time and Frequency? Asking because, in dealing with the world as arrays, the closer you can get to the underlying 64 bit integers, the better. For things like Time (for example) it’s common practice to write a function that extracts seconds (or fractions of) and convert them into seconds from some epoch, and a second function that takes the 64-bit integer and converts it back into the original time. What about “Location”? Is it an identifying string? Or is there (for example) an [ X, Y ] component? (Is this data geographic? Or are you just looking at sensor series in machinery, for example?)

I’m a bit puzzled by something. The dimensions of an array are similar to the key of a SQL table: each combination of values for the dimensions identifies at most one cell in the array. Yet … what happens here when two different sensor types, or two different locations, produce the same frequency at the same time? You also say that “many readings will have the same string except when the sensor is changed.”

Can I suggest that the independent variables you’re dealing with here are “Location”, and “Time”, with the others being dependent variables? That is …

Can I also assume that the Sensor Type does not vary (much) by Location? Except when you change the sensor? (See below for more comentery.)

A central consideration in designing this kind of schema is the nature of the workload you’re going to be applying to it. What questions do you want to ask of this data? Variance of power and frequency by time? Distribution of energy ( power * frequency ) over the range of sensor locations? Asking because, once you’ve gotten the schema down, the next thing you’re going to ask is about queries.

OK - on to some hopefully helpful explainations.

SciDB adopts a columnar storage system. Our first act is to break the list of an array’s attributes up into seperate data storage, one per attribute. The purpose of this strategy is two-fold. First, it means that for very “wide” data sets (with lots and lots of columns/attributes) queries which address only a small sub-set of the attributes are executed without bringing on all of the superflous attributes’ data. Second, it means that we can better exploit compression and other space reduction techniques on per-attribute data (less entropy = better compression).

So … take your problematic “sensor_type” string. Given that the sensor_type is pretty much always determined by the Location, and only varies from time to time over the Time, SciDB is able to use run-length encoding to reduce the space used to hold this string to next to nothing.

So the answer to your question “if I repeat these strings, will scidb take care of managing storage so that I don’t blow up memory usage” is “yes”.

There’s a major question we need to answer about the chunk sizes (per-dimension chunk lengths) to use. I am going out on a limb here to suggest that your data might be very “sparse” (lots of points of time at which no data was submitted for a particular sensor location, or alternatively, lots of “spaces” in the frequency / time space that are empty.

Search these forums for “chunk size” related questions. The high order bit idea is that you want the combination of your per-dimension chunk length values to be such that you get about 1,000,000 “cells per chunk”.

Finally, like I said above, I’m not sure your schema design quite reflects what I suspect your data looks like, and how it’s organized. I’ve put together a little script that is my attempt to model your data and illustrate how it might be organized.

Thank you very much for such a detailed reply! You got me thinking again. However, I am confused about how to select attributes vs. dimensions. Lets say, I pick location (i.e. latitude and longitude) to be two dimensions. I can convert all dimensions to integer by simple multiplication. So then I would have an array defined as follows.

Clearly, here power will be repeated. I could have many readings with the same power value. But does that matter? If I wanted to find all readings at a given power, I can select on just the power index and I will get all such readings. So my question really is, in designing the array, what should I use as an attribute and what should I use as the dimension and does that choice affect performance?

BTW just to introduce myself. I work at NIST in Gaithersburg MD. We are working on a project that involves spectrum sensing at different locations to characterize spectrum usage. I am still at the stage of evaluating which database to use but SciDb looks pretty compelling.

You certainly can store the data (initially) with power and frequency as dimensions, but what you’d produce would be a very, very sparse array. And one that’s probably highly skewed to boot. SciDB is OK with that. But there might be no conceptual advantage to doing things that way.

Back to my suggested array … this time slightly modified to include your lat/long/time three dimensions. . .

NOTE: I’ve deliberately kept the chunk lengths out of this declaration. Their precise values depend on the data. Also, you might consider using some overlapping chunks to permit the efficient calculation of things like moving windows.

In SciDB, it’s perfectly possible to “filter” the contents of the array by looking at the value in each cell.

When you do this kind of query though, you don’t get any “indexing” on the filter on power_value. But that might not be a problem. The search is conducted entirely in parallel, and so long as your filtering ranges aren’t that large, this is actually a better execution strategy than using any kind of indexing. And you can turn the data from this form into (say) frequency v. power using queries. The following query takes the data from (my) AlternSpectroData array and computes the number of “events” that occur in each (range) of power_value and frequency values.