Submitting a Map-Reduce Job using CassandraJobConf

Apologies in advance if this seems a bit 101, but I'm struggling to submit a Map-Reduce job to the Brisk-managed JobTracker via code. The class that represents the MR job extends Configured and implements Tool. In its run method, I pass a new CassandraJobConf to the Job constructor (see code below). While the job runs, it seems to spawn it's own Hadoop framework to process the job as opposed to passing the job along to the Brisk-managed JobTracker (it also doesn't show in the JobTracker or TaskTracker web apps). I haven't seen an example using the CassandraJobConf so not sure if this is the intended usage. Any thoughts would be greatly appreciated.

Thanks for responding so quickly! After swapping out the CassandraJobConf for the deprecated JobConf, we still cannot see the job in the JobTracker or TaskTracker sites. The log output is posted below. We only have a single node configured at the moment in our test environment. We can start the node with or without Hadoop enabled and the job still runs as seen below. Anything else you might suggest?

If I manually set the mapreduce.jobtracker.address property on the configuration (e.g., conf.set("mapreduce.jobtracker.address", CassandraJobConf.getJobTrackerNode().getHostName() + ":8012")), then an exception will be thrown to indicate it can't find the cassandra.yaml (see below).

05-07-2011 17:07:29 [http-8080-1] ERROR config.DatabaseDescriptor - Fatal configuration error
org.apache.cassandra.config.ConfigurationException: Cannot locate cassandra.yaml
at org.apache.cassandra.config.DatabaseDescriptor.getStorageConfigURL(DatabaseDescriptor.java:111)
at org.apache.cassandra.config.DatabaseDescriptor.<clinit>(DatabaseDescriptor.java:121)
at org.apache.cassandra.db.ColumnFamily.getComparatorFor(ColumnFamily.java:397)
at org.apache.cassandra.db.ReadCommand.getComparator(ReadCommand.java:94)
at org.apache.cassandra.db.SliceByNamesReadCommand.<init>(SliceByNamesReadCommand.java:44)
at org.apache.cassandra.db.SliceByNamesReadCommand.<init>(SliceByNamesReadCommand.java:38)
at org.apache.cassandra.hadoop.trackers.TrackerManager.getCurrentJobtrackerLocation(TrackerManager.java:51)
at org.apache.cassandra.hadoop.trackers.CassandraJobConf.getJobTrackerNode(CassandraJobConf.java:62)
at ntoklo.matrix.impl.computation.hadoop.DependencyGraph.run(DependencyGraph.java:77)
at ntoklo.matrix.impl.controller.DataController.computeGraphs(DataController.java:206)
at ntoklo.matrix.impl.MatrixImpl.serviceComputeGraphs(MatrixImpl.java:89)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:167)
at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:70)
at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:279)
at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:136)
at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:86)
at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:136)
at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:74)
at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1347)
at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1279)
at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1229)
at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1219)
at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:419)
at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537)
at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:699)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:717)

If I make the brisk cassandra.yaml available to the web app classpath, then I get the following error:

It is executed via a call to a restful endpoint in a web app. We just create a new instance of the class that extends Tool and execute the run method. Are we required to use the command line to schedule MR jobs in Brisk?

You need to know ahead of time what the jobtracker address is. Assuming you have run: brisktool jobtracker and have the location
you must set this as the value of the property mapred.job.tracker in your configuration...

I the version of hadoop you are using on your service the same as brisk (you should be using the brisk jars) Also, you need to use the core-site.xml in your config. I don't see cfs:// anywhere in your config log statements.

I don't quite understand what you are doing in terms of your service but it seems very non-standard :)

Thanks for your feedback. Let me take a step back for a second and explain what we had in place prior to our Brisk integration effort. Our product is comprised of three web apps (each of which is the product of multiple projects - all Maven-based). Two of the three web apps handle the OLTP side of our product and one of the web apps handles the OLAP side. About a month ago, we replaced one of our OLAP tasks with a Hadoop implementation and saw significant performance improvement. This Hadoop task lived within the web app and was executed via the ToolRunner as the result of a restful request to the web app. All of this worked albeit with limitations. So given the promise of Brisk, we're quite keen to integrate. Unfortunately we did not find any examples on your site that relate to the way we submitted jobs previously. When you write our service "seems very non-standard", how do you mean specifically? What is the "standard" way to submit jobs? Command line?

I've aligned the versions of Hadoop to use 0.20.203-brisk1. As we're a Maven shop, we pushed the ivy/hadoop-core.pom.xml that we found in your beta 1 distribution when creating an artifact for the brisk version of Hadoop. Not sure if this is an accurate pom for the brisk hadoop core code. We also added the core-site.xml to our web app (the same one that came down with the Brisk beta 1 binary). It initially failed (ClassNotFound for SnappyException) though we added that dependency to overcome. Then it complained that it couldn't find the cassandra.yaml. Needless to say, it seems this is not the recommended approach. :o)