I've starting using SequenceFiles more and more (in particular theelephant bird load and storage functions) and am wondering what's thebest approach is for marshaling between a schema from pig (which canhave some arbitrary number of fields) and a sequence files (which musthave two fields; key and value).

I can see two options...1) A simple writeable convertor to convert to something like f1 and acomposite f2, f3 field2) Packing the fields myself using something like "a = foreach agenerate f1, TOTUPLE(f2, f3)"

We tend to write protobuf or thrift definition for complex objects,but that introduces severe latency into the development process.I suppose you could try something like kryo (and create acorresponding deserializer for EB).. the core of the problem is thatyou need to carry around the schema, and you probably don't want towrite it into every tuple.

D

On Sat, Sep 15, 2012 at 5:15 PM, Mat Kelcey <[EMAIL PROTECTED]> wrote:> Hey all,>> I've starting using SequenceFiles more and more (in particular the> elephant bird load and storage functions) and am wondering what's the> best approach is for marshaling between a schema from pig (which can> have some arbitrary number of fields) and a sequence files (which must> have two fields; key and value).>> When I've got two fields its trivial...>> %declare SEQFILE_STORAGE> 'com.twitter.elephantbird.pig.store.SequenceFileStorage';> %declare TEXT_CONVERTER 'com.twitter.elephantbird.pig.util.TextConverter';> %declare LONG_CONVERTER> 'com.twitter.elephantbird.pig.util.LongWritableConverter';> a = load 'x' as (f1:chararray, f2:chararray);> store a into 'y' using $SEQFILE_STORAGE( '-c $TEXT_CONVERTER', '-c> $TEXT_CONVERTER');>> but what's the best way to handle something with 3+ fields?>> a = load 'x' as (f1:chararray, f2:chararray, f3:chararray);>> I can see two options...> 1) A simple writeable convertor to convert to something like f1 and a> composite f2, f3 field> 2) Packing the fields myself using something like "a = foreach a> generate f1, TOTUPLE(f2, f3)">> But both are super clumsy and require unpacking when i reread things.>> Am I missing something obvious here?>> Cheers,> Mat

I guess I was looking for a quick win for a simple flat schema; aserialisation format feels a bit of overkill for what I'm doing.I might be able to just JSON my way out of this specific problem...Cheers!Mat

On 15 September 2012 19:44, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:> We tend to write protobuf or thrift definition for complex objects,> but that introduces severe latency into the development process.> I suppose you could try something like kryo (and create a> corresponding deserializer for EB).. the core of the problem is that> you need to carry around the schema, and you probably don't want to> write it into every tuple.>> D>> On Sat, Sep 15, 2012 at 5:15 PM, Mat Kelcey <[EMAIL PROTECTED]> wrote:>> Hey all,>>>> I've starting using SequenceFiles more and more (in particular the>> elephant bird load and storage functions) and am wondering what's the>> best approach is for marshaling between a schema from pig (which can>> have some arbitrary number of fields) and a sequence files (which must>> have two fields; key and value).>>>> When I've got two fields its trivial...>>>> %declare SEQFILE_STORAGE>> 'com.twitter.elephantbird.pig.store.SequenceFileStorage';>> %declare TEXT_CONVERTER 'com.twitter.elephantbird.pig.util.TextConverter';>> %declare LONG_CONVERTER>> 'com.twitter.elephantbird.pig.util.LongWritableConverter';>> a = load 'x' as (f1:chararray, f2:chararray);>> store a into 'y' using $SEQFILE_STORAGE( '-c $TEXT_CONVERTER', '-c>> $TEXT_CONVERTER');>>>> but what's the best way to handle something with 3+ fields?>>>> a = load 'x' as (f1:chararray, f2:chararray, f3:chararray);>>>> I can see two options...>> 1) A simple writeable convertor to convert to something like f1 and a>> composite f2, f3 field>> 2) Packing the fields myself using something like "a = foreach a>> generate f1, TOTUPLE(f2, f3)">>>> But both are super clumsy and require unpacking when i reread things.>>>> Am I missing something obvious here?>>>> Cheers,>> Mat

+

Mat Kelcey 2012-09-16, 03:26

NEW: Monitor These Apps!

All projects made searchable here are trademarks of the Apache Software Foundation.
Service operated by Sematext