Rephrase the sentence "Or what APIs I should start with for my testing?": I
mean "What HDFS APIs I should start to look into for my testing?
Thanks,
Xueling
On Tue, Jan 5, 2010 at 5:24 PM, Xueling Shu wrote:
> Hi Todd:
>
> After finishing some tasks I finally get back to HDFS testing.
>
> One question for your last reply to this thread: Are there any code
> examples close to your second and third recommendations? Or what APIs I
> should start with for my testing?
>
> Thanks.
> Xueling
>
>
> On Sat, Dec 12, 2009 at 1:01 PM, Todd Lipcon wrote:
>
>> Hi Xueling,
>>
>> In that case, I would recommend the following:
>>
>> 1) Put all of your data on HDFS
>> 2) Write a MapReduce job that sorts the data by position of match
>> 3) As a second output of this job, you can write a "sparse index" -
>> basically a set of entries like this:
>>
>>
>>
>> where you're basically giving offsets into every 10K records or so. If
>> you index every 10K records, then 5 billion total will mean 100,000
>> index entries. Each index entry shouldn't be more than 20 bytes, so
>> 100,000 entries will be 2MB. This is super easy to fit into memory.
>> (you could probably index every 100th record instead and end up with
>> 200MB, still easy to fit in memory)
>>
>> Then to satisfy your count-range query, you can simply scan your
>> in-memory sparse index. Some of the indexed blocks will be completely
>> included in the range, in which case you just add up the "number of
>> entries following" column. The start and finish block will be
>> partially covered, so you can use the file offset info to load that
>> file off HDFS, start reading at that offset, and finish the count.
>>
>> Total time per query should be <100ms no problem.
>>
>> -Todd
>>
>> On Sat, Dec 12, 2009 at 10:38 AM, Xueling Shu
>> wrote:
>> > Hi Todd:
>> >
>> > Thank you for your reply.
>> >
>> > The datasets wont be updated often. But the query against a data set is
>> > frequent. The quicker the query, the better. For example we have done
>> > testing on a Mysql database (5 billion records randomly scattered into
>> 24
>> > tables) and the slowest query against the biggest table (400,000,000
>> > records) is around 12 mins. So if using any Hadoop product can speed up
>> the
>> > search then the product is what we are looking for.
>> >
>> > Cheers,
>> > Xueling
>> >
>> > On Fri, Dec 11, 2009 at 7:34 PM, Todd Lipcon wrote:
>> >
>> >> Hi Xueling,
>> >>
>> >> One important question that can really change the answer:
>> >>
>> >> How often does the dataset change? Can the changes be merged in in
>> >> bulk every once in a while, or do you need to actually update them
>> >> randomly very often?
>> >>
>> >> Also, how fast is "quick"? Do you mean 1 minute, 10 seconds, 1 second,
>> or
>> >> 10ms?
>> >>
>> >> Thanks
>> >> -Todd
>> >>
>> >> On Fri, Dec 11, 2009 at 7:19 PM, Xueling Shu
>> >> wrote:
>> >> > Hi there:
>> >> >
>> >> > I am researching Hadoop to see which of its products suits our need
>> for
>> >> > quick queries against large data sets (billions of records per set)
>> >> >
>> >> > The queries will be performed against chip sequencing data. Each
>> record
>> >> is
>> >> > one line in a file. To be clear below shows a sample record in the
>> data
>> >> set.
>> >> >
>> >> >
>> >> > one line (record) looks like: 1-1-174-418 TGTGTCCCTTTGTAATGAATCACTATC
>> U2
>> >> 0 0
>> >> > 1 4 *103570835* F .. 23G 24C
>> >> >
>> >> > The highlighted field is called "position of match" and the query we
>> are
>> >> > interested in is the # of sequences in a certain range of this
>> "position
>> >> of
>> >> > match". For instance the range can be "position of match" > 200 and
>> >> > "position of match" + 36 < 200,000.
>> >> >
>> >> > Any suggestions on the Hadoop product I should start with to
>> accomplish
>> >> the
>> >> > task? HBase,Pig,Hive, or ...?
>> >> >
>> >> > Thanks!
>> >> >
>> >> > Xueling
>> >> >
>> >>
>> >
>>
>
>