Archive

Today I attended to the first day of the 5th TCE conference, this year topic was “Scaling Systems for Big Data“. There were some nice lectures, especially the first one which was the best of today.

This lecture was from the Software Reliability Lab a research group in the Department of Computer Science at ETH Zurich led by Prof. Martin Vechev, who presented the lecture. The topic was “Machine Learning for Programming” where machine learning is used on open source repositories (github and alike) to create statistical models for things that were once “science fiction” like – code completion (not a single word or method but full bunch of code into a method), de-obfuscation (given an obfuscated code you’ll get a nicely un-obfuscated code with meaningful variable names and type) and others…. This is a very interesting usage of machine learning and perhaps soon we (developers) may be obsolete 🙂
Some tools using this technique – http://jsnice.org which shows de-obfuscation of javascript code and the http://nice2predict.org framework on top is built jsnice.

Few facts from a short google talk on building scalable cloud storage:

The corpus size is growing exponentially (nothing really new here)

Systems (“cloud storage systems”) require a major redesign every 5 years. That’s the interesting fact… Let remember Google had GFS (Google file system – which HDFS is an implementation of it), then Google moved to Colossus (in 2012) so according to that in 2017 should we see a new file system? If so they certainly work on it already….

If you are interested in mining and checking MS Excel files for error and suspicious values (indicating that some values might be human error) then checkcell.org might be the solution for you. What about survey? Can survey have errors too? Well it seems that same question presenting in different order will produce different results (human are sometimes really non logical) so if you have a survey and want to check if you inserted some bias by mistake then surveyman is the answer. You can refer to Emery Berger’s (who gave the talk) blogs for cellckeck and surveyman (http://emeryberger.com/research/checkcell/ and http://emeryberger.com/research/surveyman/ respectively)

Another nice talk from Lorenzo Avisi (UT Austin) about SALT. A combination between the ACID and BASE (in chemistry ACID + BASE = SALT) principle in a distributed database. So you can scale a system and still use relational database concept instead of moving to a pure BASE databases which increase the system complexity. The idea is to break relational transactions into new transaction types having better granularity and scalability. The full paper can be found here https://www.cs.utexas.edu/lasr/paper.php?uid=84

By the way if you are using map reduce an interesting fact from another talk by Bianca Schroeder from Toronto University (this is a starting paper is) that long running jobs tend to fail more often that short ones and retrying the execution more than twice is just a waste of cluster resource because it will almost for sure fail again. By using machine learning the research team is able to predict after 5 minutes of run the probability of failure of the job or not. The observation were done on google cluster and open cluster too. This is for sure a nice future paper…

The Hadoop ecosystem contains a lot of sub project. Hbase and Pig are just some of them.

Hbase is the Hadoop database, allowing to manage your data in a table way more than in a file way.

Pig is a scripting language that will generate on the fly map reduce job to get the data you need. It is very compact compared to hand writing map reduce job.

One of the nice thing between Pig and Hbase is that they can be integrated. Thanks to recent patch committed.

The documentation is not well updated yet (currently almost relate to the patch itself) some can be found on some post like herebut they all lack of details explanation. Even the Cloudera distribution CDH3 indicates support for this integration but no sample can be found.

Below I describe the installation and configuration steps to make the integration works, provide and example and finally expose some of the limits of the current release (0.8)

First, install the map reduce components (Job tracker and Task tracker). One Job tracker and many task tracker as you have data nodes. Each distribution may provide different procedure for the installation, I’m using the Cloudera CDH3 distrib, which for the map reduce installation is well documented.

Now proceed with the Pig installation, it is also easy as long you are not trying the integration with Hbase. You need only to install pig on the client side, you do not need to install it on each Data Node neither on the Name Node, but just on the machine where you want to run the pig program.

Check your installation by entering the the grunt shell (just enter ‘pig’ from the shell).

Now the tricky part – In order to use Pig/Hbase integration you in fact need to make Map Reduce jobs aware of Hbase classes, otherwise you will have “ClassNotFoundException” or worst the zookeeper exception like “org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase” during execution. The way to perform this easily without coping the hbase configurations into your hadoop configuration dir, is by using hadoop-env.sh and hbase to print its own classpath.
So add to your hadoop-env.sh file file the following

You will also need pig to be aware of Hbase configuration, for this you can use the HBASE_CONF_DIR environment variable (for CDH release), which is configured by default to be /etc/hbase/conf,

Ok your installation should be fine now, so let’s do an example…. For this example let assume we have stored in HBase a schema named TestTable, and column family named A, we have also several fields named field0, field1,…, and we want to extract this information and store it into ‘results/extract’. In this case the pig script will looks like:

So the above script indicate that the my_data relation will contains the fields “field0, field1” and the ID (due to the -loadKey parameter). These fields will be stored as id, field0, field1 under the ‘result/extract’ folder and values will be separated by semicolon.

You can also use some comparison operator on the key. The current operator supported are lt, lte, gt, gte for lower than, lower than or equal, greater than and greather than or equal.

Note: There is no support for logical operator, you can use more than one comparison operator which are chained as AND.

Limitations:

The current HBaseStorage, does not allow the usage of wildcard, that is if you need all the fields in a row, you need to enumerate them. Wildcard are supported in version 0.9.

You can use HBaseStorage to store back the records in HBase nevertheless the HBase usage is incosistent a bug was already opened on this.