Archive

Today I attended to the first day of the 5th TCE conference, this year topic was “Scaling Systems for Big Data“. There were some nice lectures, especially the first one which was the best of today.

This lecture was from the Software Reliability Lab a research group in the Department of Computer Science at ETH Zurich led by Prof. Martin Vechev, who presented the lecture. The topic was “Machine Learning for Programming” where machine learning is used on open source repositories (github and alike) to create statistical models for things that were once “science fiction” like – code completion (not a single word or method but full bunch of code into a method), de-obfuscation (given an obfuscated code you’ll get a nicely un-obfuscated code with meaningful variable names and type) and others…. This is a very interesting usage of machine learning and perhaps soon we (developers) may be obsolete 🙂
Some tools using this technique – http://jsnice.org which shows de-obfuscation of javascript code and the http://nice2predict.org framework on top is built jsnice.

Few facts from a short google talk on building scalable cloud storage:

The corpus size is growing exponentially (nothing really new here)

Systems (“cloud storage systems”) require a major redesign every 5 years. That’s the interesting fact… Let remember Google had GFS (Google file system – which HDFS is an implementation of it), then Google moved to Colossus (in 2012) so according to that in 2017 should we see a new file system? If so they certainly work on it already….

If you are interested in mining and checking MS Excel files for error and suspicious values (indicating that some values might be human error) then checkcell.org might be the solution for you. What about survey? Can survey have errors too? Well it seems that same question presenting in different order will produce different results (human are sometimes really non logical) so if you have a survey and want to check if you inserted some bias by mistake then surveyman is the answer. You can refer to Emery Berger’s (who gave the talk) blogs for cellckeck and surveyman (http://emeryberger.com/research/checkcell/ and http://emeryberger.com/research/surveyman/ respectively)

Another nice talk from Lorenzo Avisi (UT Austin) about SALT. A combination between the ACID and BASE (in chemistry ACID + BASE = SALT) principle in a distributed database. So you can scale a system and still use relational database concept instead of moving to a pure BASE databases which increase the system complexity. The idea is to break relational transactions into new transaction types having better granularity and scalability. The full paper can be found here https://www.cs.utexas.edu/lasr/paper.php?uid=84

By the way if you are using map reduce an interesting fact from another talk by Bianca Schroeder from Toronto University (this is a starting paper is) that long running jobs tend to fail more often that short ones and retrying the execution more than twice is just a waste of cluster resource because it will almost for sure fail again. By using machine learning the research team is able to predict after 5 minutes of run the probability of failure of the job or not. The observation were done on google cluster and open cluster too. This is for sure a nice future paper…

Cloudera is giving a nice solution if you want to play with hadoop ecosystem (and other cloudera’s add-ons), which is a virtualized single-node Hadoop cluster. The VM is available for VMWare, KVM and Virtual Box and can be downloaded from Cloudera download site

Lately I faced the problem that the VM predefined size was not enough (25GB) and I needed to increase the disk space, something that sound trivial, cost me several hours to figure out how to manage (especially when you are not an Linux admin, and when the graphical user interface of the virtualized guest OS is missing some system functionality).

So below are the instruction for that. I’ll show how to increase the disk size from 25GB to 100GB on VMWare image (using VMWare Player), of CDH 4.x

In the VMWare player, when the cloudera image is shutdown, go to the VM settings, Hardware tab, select Hard Disk devices select the utility button and the expand option. In the dialog set the new size (here 100 GB), press ok
Once completed (this operation will take several minutes, depending on the disk size), a new popup will inform you that the virtualized disk size was increase but you need to modify your guest OS to use the new size. In order to do so we will need to perform several admin operation in the guest OS which is in our case CentOS linux distrib.

The next step is to modify the boot option to start the guest OS without any services and no graphical UI, we will in fact start the guest OS as run level 1 (single user mode). So start the VM, and at the boot screen press any key to enter the GRUB boot manager (you will have 3 seconds to do so). When the GRUB nenu shows, go to the entry for cloudera demo vm, press ‘e‘ to edit the entry, go to the kernel entry and press again ‘e‘, then add ‘1‘ at the end of the line, press enter then ‘b‘ to boot to the newly modified option.

Once booted, login as root (password is cloudera), check the disk size using ‘df -h’ you will see that the disk size is still 25GB, using ‘fdisk -l’ shows the physical disk and it’s allocated partition, we can see that the physical disk (/dev/sda) already reflects the increase of the size but not the partition (/dev/sda1) .
We will change this using fdisk and resize2fs commands

So a the prompt type: ‘fdisk /dev/sda‘, we will delete and recreate the partition, pay attention that the newly created partition needs to start at the same size than the one we delete, in order to note its starting point in fdisk prompt press ‘p‘. In the previous screenshot /dev/sda1 started at 1.

in fdisk prompt press’d‘ to delete the partition, since there is only one partition it will be automatically deleted.

press ‘n’ to create a new partition, then ‘p‘ for primary partition, then ‘1‘ for the partition number, put your previous starting number or press ‘enter‘ to pick the default, put the end size or press ‘enter‘ to pick the default max size (here 100GB)

at this stage a new partition /dev/sda1 should have been created.

type ‘w‘ to write the file partition change to disk.

quit fdisk using ‘q‘ command

reboot as before (you can use ‘reboot‘ command for that) and at the start screen proceed like in step #2 to boot as run level 1), login as root again. Now using ‘df -h’ will still show a disk size 25GB, but using ‘fdisk -l‘ will show that the new partition as a size of 100GB as shown below.

In order to resolve this, at the command prompt type ‘resize2fs /dev/sda1‘. Once the resize ends, your increased disk space should be reflected via ‘df -h‘, below 99GB.

That’s it. You can now reboot as usual and enjoy your increased disk space.

The Hadoop ecosystem contains a lot of sub project. Hbase and Pig are just some of them.

Hbase is the Hadoop database, allowing to manage your data in a table way more than in a file way.

Pig is a scripting language that will generate on the fly map reduce job to get the data you need. It is very compact compared to hand writing map reduce job.

One of the nice thing between Pig and Hbase is that they can be integrated. Thanks to recent patch committed.

The documentation is not well updated yet (currently almost relate to the patch itself) some can be found on some post like herebut they all lack of details explanation. Even the Cloudera distribution CDH3 indicates support for this integration but no sample can be found.

Below I describe the installation and configuration steps to make the integration works, provide and example and finally expose some of the limits of the current release (0.8)

First, install the map reduce components (Job tracker and Task tracker). One Job tracker and many task tracker as you have data nodes. Each distribution may provide different procedure for the installation, I’m using the Cloudera CDH3 distrib, which for the map reduce installation is well documented.

Now proceed with the Pig installation, it is also easy as long you are not trying the integration with Hbase. You need only to install pig on the client side, you do not need to install it on each Data Node neither on the Name Node, but just on the machine where you want to run the pig program.

Check your installation by entering the the grunt shell (just enter ‘pig’ from the shell).

Now the tricky part – In order to use Pig/Hbase integration you in fact need to make Map Reduce jobs aware of Hbase classes, otherwise you will have “ClassNotFoundException” or worst the zookeeper exception like “org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase” during execution. The way to perform this easily without coping the hbase configurations into your hadoop configuration dir, is by using hadoop-env.sh and hbase to print its own classpath.
So add to your hadoop-env.sh file file the following

You will also need pig to be aware of Hbase configuration, for this you can use the HBASE_CONF_DIR environment variable (for CDH release), which is configured by default to be /etc/hbase/conf,

Ok your installation should be fine now, so let’s do an example…. For this example let assume we have stored in HBase a schema named TestTable, and column family named A, we have also several fields named field0, field1,…, and we want to extract this information and store it into ‘results/extract’. In this case the pig script will looks like:

So the above script indicate that the my_data relation will contains the fields “field0, field1” and the ID (due to the -loadKey parameter). These fields will be stored as id, field0, field1 under the ‘result/extract’ folder and values will be separated by semicolon.

You can also use some comparison operator on the key. The current operator supported are lt, lte, gt, gte for lower than, lower than or equal, greater than and greather than or equal.

Note: There is no support for logical operator, you can use more than one comparison operator which are chained as AND.

Limitations:

The current HBaseStorage, does not allow the usage of wildcard, that is if you need all the fields in a row, you need to enumerate them. Wildcard are supported in version 0.9.

You can use HBaseStorage to store back the records in HBase nevertheless the HBase usage is incosistent a bug was already opened on this.

This year the Technion, (Israel institute of technology) held the “The 1st Technion Computer Engineering (TCE) Conference (June 2, 2011)“, I registered myself to the second day (Thursday 2, June). This day topics were on “Computer Architecture & Systems” and lot of lecturer both from academic and industry were present, among them I had the pleasure to listen to Leslie Lamport (from Microsoft), well known for his “Paxos algorithm” paper (see resources) and really nice other lecturers like Yale N. Patt (Texas University).

Below are some of my impression for this day

Moore Law seems to reach limit – or at least it is not sure that the race to double the number of transistor each 18 months still the primary goal of the CMOS industry (too much core we barely need so much)

Parallel programming, is one of the major topic that should be learn, and we should stop creating the “its complicated” barrier on the topic

Security on the cloud data seems to have a major attention both by Microsoft and Intel

Theory (Mathematics) and Hardware knowledge are important to achieve performant programming.

Well, I never figured out to myself that I will write a C++ article when my main specialization is Java. Anyway during the last three years I’m involved in a cross discipline project involving JAVA and C++.

In this project a JAVA generator generates millions of lines of C++ code which of course have to be compiled, and if you are a C++ guy you certainly already have your hair standing on your head because of the time it will require to compile such huge amount of code. Well you are right, we faced extremely long compilation time (12+ hours in Unix, in windows…), which are major problem in a product that should have a quick time to market.

Worst, the product is used both under Windows and Unix platforms, which means that a solution need to be created for both worlds.

Under Windows – even with the usage of Incredibuild from Xoreax (a great grid compiler platform which allowed to reduce considerably the compilation time), the user still needed to wait 2 hours for compilation, which was not acceptable.

Under Unix – no grid compiler (unless you work only on few platform), we tried to use distcc but the results were still not satisfying and you need additional hardware. We where stuck…

Therefore we begun a research for an alternative that could speed up the builds, and for that we needed to understand the factor that affected compilation time, our main suspicious factor was the number of lines of code to compile, since the code was generated, it was very easy to inflate the output nevertheless we soon understood that we were wrong…

Below are factors impacting the compilation time (according to their impact)

In the first place the number of files to compile – this is one of the major factor affecting compilation time, the compiler is not really smart at reusing information it processed between invocation and is not able either to work on a set of files, and it is especially slow (I/O bound) for building dependencies. If you want to really reduce the compilation time reduce the number of file to compile, and it does not mean to write all in a single file, you can use what is called Unity Build.
A Unity Build group several cpp files in a single one using just the include directive. For example let say you want to compile file1.cpp to file 10.cpp then create a new file group.cpp as follow:

Now compile group.cpp and don’t forget to add the file1.cpp…file10.cpp location as include path.This method produces miracle (of course you have to balance the number of files you put in a single group/unity). Our compilation reduced from 15 hours to just 2.

Include paths – large number of include path directly affect the build time, since the compiler (or pre-compiler) need to scan all the path until it find the requested include. So try to minimize them or at least organize the path list according to the most searched one.

NAS (Network Attached Storage) also has a bad impact on the compilation (write is usually fast, but read is slow so library creation is slow).

Generate cppdep and compile on the same time – unix compilers support option to create cppdep file and compile at the same time you can save approx 20% of your compilation time.

Forward declaration, also know as the “Pimpl idiom” to reduce dependency, greatly help, the problem is that you cannot always refactor the code, to avoid some include that will erase your effort.

Usage of template – using C++ template excessively increase compilation time and libraries volume (especially if the template is declared in header).

Number of strings constants in a single file. It might should strange, but some compilers (HP and Sun at least) have a performance degradation when the compilation unit contains an large number of string (few thousand).Note: Visual Studio compiler is not sensitive to this factor.

Generic vs inflated code – using call to function or writing the content of function where you need them (like a forced “inline”). Inlining function in this way may produce better performance, but does not affect compilation time, as much as you think it affect. Effectively we reduced millions of lines of code by 75% using call to function instead of inlining their content, but we got no improvement in build time, but at least you gain more maintainable/debugable code.

Usage of pre-compiled header might help, but from our test they did not, the compilation time was in fact increased.

Usage of header cache folder – similarly to pre-compiled headers, should help (according to vendors) but from our test they most of the time do not.

So if you really want to reduce your compilation time try Unity Build concept, you will gain in:

Faster build time

Smaller objects size

Smaller libraries size

Better optimized code

Note: The compilation time is related to the number of cpp files in a single Unity Build (and their dependencies), and this number should be tuned according to the included file content (inline, template usage, headers used…). If you have too much file in a single unity/group file compilation time increase back (still better than when no using group file), nevertheless the library size declines (even if the compilation time increased back).

Jar files (JAva aRchives) are very convenient containers, you can pack all you need for your application (at least for classes and resources), put the jar on the target environment and just run java -cp <myapp.jar> <appMain> <command line args> to execute your program.

With a jar file you don’t need scripts or long command line to setup your classpath for execution. Nevertheless if you can do better than configuring the classpath and the main from command line, you can use the manifest file for this. Doing so, you can just type java -jar <command line args>

The manifest is a text file (property like) containing information on the archive, as part of this information you can define the main class of the archive and define the classpath (as long you did not pack other jar too)

In order to do so, define in the manifest the following tag ‘Class-Path’ and ‘Main-Class’. Following is a sample:
Main-Class: sample.package.MyMain
Class-Path: directory-one/sub-directory-one/referenced.jar directory-two/

Keep in mind that:

You specify several directories and/or referenced jar using a space as delimiter

Reference to directories and other jars are relative to the jar

Any referenced jar using the Class-Path attribute cannot be present in your original archive (without special classloader)

If you have resource in some directory don’t forget the slash at the end otherwise the content of the directory is not seen.

Ant is a powerful build and script tool provided by Apache Foundation. In a recent project I used the exec task and needed to allow usage of spaces in command line argument of the called executable. A legitimate request, but if you don’t pay attention on the different ‘exec’ parameters syntax, you may waste a lot of time…

Handling space in “exec” task’s arguments.

The exec task of ant allow to execute system command. The arguments of the command are passed as arg sub tags. The sub tag is the arg tag, followed by an attribute named value, line or path. If your argument contains space (for example a file path) do not use the line attribute, instead use the value or path ones. The line attribute will consider the spaces as separator of command line argument for the program executed.

For example you need to pass a text file as command line to a document editing application, and that the file is located under C:\Documents and settings\msthoughts\doc.txt

Will produce an error since the command line is interpreted as 3 args each one denoted with brakets: [c:\Documents] [and] [settings\msthoughts\doc.txt]

To resolve this in command line window you usually use the quotes to wraps the command line arg having space. But using quote from ant will lead to other problem very quickly. The better solution is to use the value attribute instead of the line attribute