This forum is now a read-only archive. All commenting, posting, registration services have been turned off. Those needing community support and/or wanting to ask questions should refer to the Tag/Forum map, and to http://spring.io/questions for a curated list of stackoverflow tags that Pivotal engineers, and the community, monitor.

starting up hive within my spring data app

Feb 11th, 2013, 08:25 AM

I have a script that is running fine when i connect remotely to my hive server but fails with the very generic "Query returned non-zero code: 1, cause: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MapRedTask, errorCode:1"
when i run the hive server as part of my spring application. I assume that i need to have it start up with same parameters or the like. Has anyone run into this? I have the hive-site.xml on my classpath.

hive-site.xml is needed on the server side not the client as most of the configuration in this case is done through Spring. Hive and Hadoop in general are fairly cryptic and there's not much SHDP can do about this.
Make sure to look into the Hive server logs as well to see whether something goes wrong on the server - not having the derby instance running or a missing library tend to be common errors.

Comment

hive-site.xml is needed on the server side not the client as most of the configuration in this case is done through Spring. Hive and Hadoop in general are fairly cryptic and there's not much SHDP can do about this.
Make sure to look into the Hive server logs as well to see whether something goes wrong on the server - not having the derby instance running or a missing library tend to be common errors.

Just to clarify, my spring data app is trying to boostrap hive. When i run the code such that the client connects to an already running hiveserver, it works fine.When i run the script in the hive CLI it works, but when i run it in the instance created by the xml:
<hdp:hive-server port="${hive.port}" auto-startup="true"
properties-location="hive-server.properties"/>

That is when i get the cryptic errors. So, I am assuming that somehow the defaults that the hive server starts with differ from how it runs when i bring it up at the command line. So, i am wondering how i can have the spring hive instance start exactly as it runs from command line.
My preference is to bring it up in process so we dont have multiple processes accidentally using the same hive server due to warnings about thread safety.

Understood. It depends on how you start your hive server by hand - do you rely on any services or configuration? Make sure this are passed properly to hive-server.
Note that by definition, hive-server starts a Thrift server for use with hive-client (a thrift client).
Also, make sure that the hive-conf.xml properties are properly passed to the hive-server - it's best to specify them through properties-location then have them as a file since the classpath can differ. Note that also all the hive-related libraries and dependencies need to be available in the classpath (as opposed to just hive).

no - the properties attributes points to just that, a properties file.
You can however create a dedicated hdp:configuration attribute and pass the XML to that (and potentially set any other properties that you want in a nested fashion):http://static.springsource.org/sprin...#hadoop:config

I've just double check one of the hive server tests and there's nothing special about my setup. Note that I'm talking the hard road - using Hive client on Win machine (my dev one) accessing Hive server on a Win machine (the same) talking to a Hadoop cluster on a remote/VM machine (*nix).

First make sure the Hive libraries are in place - you typically get CNFE if you don't. In my case this meant hive-builtins/hive-metastore added to the classpath. Also make sure you're using a proper version of antlr (antlr-runtime 3.0.x) - this can be an issue if you have pig in the classpath which will pull in a more recent version of antlr which whom Hive is not compatible (and you'll know get a cryptic NoSuchField error).

by the way, the properties passed to hive-server are just for testing - you can safely ignore them. It's only hadoopConfiguration which is relevant (and that is passed by default anyway). Note that I don't have any hive-site.xml in the classpath.
In fact you can see the test (both the server and the client that runs against it) in the project test suite: https://github.com/SpringSource/spri...hive/basic.xml

Comment

Thanks. my environment is running on a configured cloudera hadoop cluster (centos). Hive does come up and the job ultimately fails, even though this same job that would succeed when running hive out of process. there is a CNFE that seems to be a red herring since the class is there. I built the classpath of my app by including /usr/lib/hadoop /usr/lib/hadoop/lib and /usr/lib/hive/lib. And i see it comes up so the jars are there including the CNF that i see. So in the stdout i see:
Exception in thread "main" java.lang.ClassNotFoundException: org.apache.hadoop.hive.ql.exec.ExecDriver

I will try later in the day to test against CDH, maybe they have a different setup then the vanilla Hive (I've tested against the official
vanilla Hive 0.8,0.9 and 0.10) - what version are you using?

the CNF might be more then a red herring - typically if a CNF is raised, the job won't fail right away but rather after it times out. This looks like your case as well. Can you enable more logging for Hive (note you have to do this inside your app) and also check the Hadoop jobs just in case ?

Comment

Thanks a lot for all your help!
we are using CDH4.1.2
I have tried to figure out how to add more logging -- other than the hive logging to the log4j - i.e - how can i start up hive in the app context with verbose mode. I can change the log4j of hive but not sure hive bootstrapped from within my spring is reading that.

As for checking the hadoop jobs, i am a newbie to hadoop -- have been usign hive with great success but not sure how to debug. What i do know as i have stressed is that the hive query works

I've tried the same scenario (Hive server started on Windows machine, Hive test running against on the Windows machine, Hadoop cluster running remotely).
In all cases, due to the change of version I had to nuke metadata_db (and change the hostname of the Hadoop VM but that's because I'm testing different distros).
CDH3u5 worked without any issues.
I have then tested using CDH 4.1.3 on the client with CDH 4.1.1 running inside the Hadoop VM - Hive threw some errors in the logs but those proved to not be harmful (they're mainly about indexes).
The full log (plus some gradle stuff) is available here: https://gist.github.com/costin/f4c0745eae071cb5214d

Note that as opposed to CDH 3 and vanilla Hadoop, I had to run this from a cygwin environment (as I'm using windows).
Additionally I had to specify the Hadoop VM by hostname not by IP - while CDH3 and vanilla Hive allow this, CDH4 does not - one gets an exception soon after.
The test is fairly comprehensive - it creates tables, adds some data and then some queries - and all passed.