If you want to do range queries on the keys, you can use OPP to do this:
(example using UTF-8 lexicographic keys, with bursts split across rows
according to row size limits)
Events: {
"20100601.05.30.003": {
"20100601.05.30.003":
"20100601.05.30.007":
...
}
}
With a future version of Cassandra, you may be able to use the same
basic datatype for both key and column name, as keys will be binary
like the rest, I believe.
I'm not aware of specific performance improvements when using OPP
range queries on keys vs iterating over known keys. I suspect (hope)
that round-tripping to the server should be reduced, which may be
significant. Does anybody have decent benchmarks that tell the
difference?
On Wed, Jun 2, 2010 at 11:53 AM, Ben Browning wrote:
> With a traffic pattern like that, you may be better off storing the
> events of each burst (I'll call them group) in one or more keys and
> then storing these keys in the day key.
>
> EventGroupsPerDay: {
> "20100601": {
> 123456789: "group123", // column name is timestamp group was
> received, column value is key
> 123456790: "group124"
> }
> }
>
> EventGroups: {
> "group123": {
> 123456789: "value1",
> 123456799: "value2"
> }
> }
>
> If you think of Cassandra as a toolkit for building scalable indexes
> it seems to make the modeling a bit easier. In this case, you're
> building an index by day to lookup events that come in as groups. So,
> first you'd fetch the slice of columns for the day you're interested
> in to figure out which groups to look at then you'd fetch the events
> in those groups.
>
> There are plenty of alternate ways to divide up the data among rows
> also - you could use hour keys instead of days as an example.
>
> On Wed, Jun 2, 2010 at 11:57 AM, David Boxenhorn wrote:
>> Let's say you're logging events, and you have billions of events. What if
>> the events come in bursts, so within a day there are millions of events, but
>> they all come within microseconds of each other a few times a day? How do
>> you find the events that happened on a particular day if you can't store
>> them all in one row?
>>
>> On Wed, Jun 2, 2010 at 6:45 PM, Jonathan Shook wrote:
>>>
>>> Either OPP by key, or within a row by column name. I'd suggest the latter.
>>> If you have structured data to stick under a column (named by the
>>> timestamp), then you can serialize and unserialize it yourself, or you
>>> can use a supercolumn. It's effectively the same thing. Cassandra
>>> only provides the super column support as a convenience layer as it is
>>> currently implemented. That may change in the future.
>>>
>>> You didn't make clear in your question why a standard column would be
>>> less suitable. I presumed you had layered structure within the
>>> timestamp, hence my response.
>>> How would you logically partition your dataset according to natural
>>> application boundaries? This will answer most of your question.
>>> If you have a dataset which can't be partitioned into a reasonable
>>> size row, then you may want to use OPP and key concatenation.
>>>
>>> What do you mean by giant?
>>>
>>> On Wed, Jun 2, 2010 at 10:32 AM, David Boxenhorn
>>> wrote:
>>> > How do I handle giant sets of ordered data, e.g. by timestamps, which I
>>> > want
>>> > to access by range?
>>> >
>>> > I can't put all the data into a supercolumn, because it's loaded into
>>> > memory
>>> > at once, and it's too much data.
>>> >
>>> > Am I forced to use an order-preserving partitioner? I don't want the
>>> > headache. Is there any other way?
>>> >
>>
>>
>