So I can have one PagedIndex CF that holdes a row for each data file I am
processing.
The columns for that row (in my example) would have X columns and I can make
those columns values be 100 strings that represent keys in another PagedData
CF
This other PagedData CF for each row would have 10,000 columns and their
values would have my data in them that I would loop through paralyze and
scale on so I can do this 100 times simultaneously.
This is really awesome because if I have 10 files each with a billion rows
in it then I push it into this pattern I can scale quite nicely providing
10,000 is my magic number of columns to page. for 10,000,000,000 rows I
would have in my first PagedIndex CF 10,000 columns (each representing 100s
PagedData rows that have data) for each of the 100 rows for each column I
can then pull that row pulling out 10,000 pieces of data to process 100 at a
time on different servers.
got it, thanks! awesome!
On Sun, Jun 5, 2011 at 4:36 PM, Jonathan Ellis wrote:
> If you need to parallelize (and scale) you need to distribute across
> multiple rows. One Big Row means all your 100 workers are hammering
> the same 3 (for instance) replicas at the same time.
>
> On Sun, Jun 5, 2011 at 1:43 PM, Joseph Stein wrote:
> > What is the best practices here to page and slice columns from a row.
> > So lets say I have 1,000,000 columns in a row
> > I read the row but want to have 1 thread read columns 0 - 9999, second
> > thread (actor in my case) 10000 - 19999 ... and so on so i can have 100
> > workers processing 10,000 columns for each of my rows.
> > If there is no API for this then is it something I should a composite key
> on
> > and have to populate the rows with a counter
> > 0000000:myoriginalcolumnnameX
> > 0000001:myoriginalcolumnnameY
> > 0000002:myoriginalcolumnnameZ
> > Going the composite key route and doing a start/end predicate would work
> but
> > then it kind of makes the insertion/load of this have to go through a
> > single synchronized point to generate the columns names... I am not
> opposed
> > to this but would prefer both the load of my data and processing of my
> data
> > to not be bound by any 1 single lock (even if distributed).
> > Thanks!!!!
> > /*
> > Joe Stein
> > http://www.linkedin.com/in/charmalloc
> > Twitter: @allthingshadoop
> > */
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>
--
/*
Joe Stein
http://www.linkedin.com/in/charmalloc
Twitter: @allthingshadoop
*/