Getting Cascading to Read Sequence Files Created Somewhere Else

Sometimes you can’t control where your data comes from or how it’s formatted. For instance, where I work a lot data is stored in SequenceFiles. Unfortunately, the files are not taking advantage of the typing SequenceFiles provide and instead each record is a single field containing delimited string.

I like to use Cascading (or cascalog) for my Hadoop jobs, but out of the box Cascading doesn’t support using SequenceFiles that were created outside of Cascading. That is to say, Cascading requires that your SequenceFiles values be an instance of Tuple.

The solution is to create your own Scheme that parses a SequenceFile according to your own format. In my case I just want to parse each line as the string list.

The code is simple but may not be obvious for a first-time Cascading user. I hope this will save someone a few minutes.

packagecom.xcombinator;importjava.io.IOException;importcascading.tap.Tap;importcascading.tuple.Fields;importcascading.tuple.Tuple;importcascading.tuple.TupleEntry;importcascading.tuple.Tuples;importcascading.scheme.SequenceFile;importorg.apache.hadoop.mapred.JobConf;importorg.apache.hadoop.mapred.OutputCollector;importorg.apache.hadoop.mapred.SequenceFileInputFormat;importorg.apache.hadoop.mapred.SequenceFileOutputFormat;/**
* A SequenceFileAsText is a type of {@link SequenceFile}, however the
* SequenceFile has been created outside of Cascading and is assumed to have a
* value of a string.
*/publicclass SequenceFileAsText extends SequenceFile
{/** Field serialVersionUID */privatestaticfinallong serialVersionUID = 1L;/** Protected for use by TempDfs and other subclasses. Not for general consumption. */protected SequenceFileAsText(){super(null);}/**
* Creates a new SequenceFileAsText instance that stores the given field names.
*
* @param fields
*/public SequenceFileAsText( Fields fields ){super( fields );}
@Override
public Tuple source(Object key, Object value ){if(value instanceof Tuple){return(Tuple) value;}elseif(value instanceofComparable){returnnew Tuple((Comparable) value);}elseif(value !=null){returnnew Tuple(String.valueOf(value));}else{returnnew Tuple((Comparable)null);}}}

Search

Why, Hello!

I'm Nate Murray and this is a blog I've been writing since 2007.
I work at IFTTT and I've been working with big data data since 2009. My work involves large-scale data mining, distributed computing, iOS & web apps. If you like this blog then you should follow me on twitter. Follow @eigenjoy