Following up on the last article, I show you briefly what the web page looks like if you simply change the business rules but at the same time we do NOT make changes to any of the other parts of the process.

In the last article you have seen how the ruleengine was integrated in the Hadoop mapreduce process to produce a web page from a CSV file (with data from geonames.org). Data is read from the CSV file, processed by the Hadoop framework and the result in then processed further to be displayed on a web page using Highcharts.

I now went ahead and changed the business rules. The data file also has information about different types of mines:

I want to display the number of mines per type and country on the web page, so I changed the business rules which I used in the last article accorind to the codes listed above:

Next step is to export the project with the business rules through the web application and then run the same mapreduce job as we did the last time. Again: we changed only the business rules, but nothing else!

And here is the resulting web page from the mapreduce run:

It is as easy as that. You have a flexible, dynamic way of deciding which data is processed and no re-coding of the mapreduce job is required.

Groovy script - to format the mapreduce result and merge it with an Apache Velocity (HTML) template

Highcharts: to show the results in a web browser

The idea is to process data and then finally display it on a web page, as shown below. I do not want to hardcode any logic of which placenames to display on the web page. Instead, I use the Business Rules Maintenance Tool - a web application - to centrally define the business logic of what data to process.

Geonames has a geographical database that covers all countries and contains over eleven million placenames that are available for download free of charge. The data contains information including latitude, longitude, elevation, country, population, timezone, last modification date and more.

On the Hadoop mapreduce side I have created a process that is generic in a sense that it processes the data, but it does not define which data is processed. The map task includes a reference to the JaRE ruleengine. So the ruleengine is used to determine which data should be considered and which data (rows) not. And so the rule logic is maintained outside the code.

Here is a snippet from the code:

The splitter simply splits the incomming row of data into single fields. This is then piped into a collection that contains all the fields including the field values. And then finally the rule engine is run against the collection. So the ruleengine evaluates if the data passes or fails the defined business logic.

The other part of the puzzle are the business rules themselves. I have defined a project named "Mapreduce Countries" in the Business Rule Maintenance Tool - a web application that is freely available.

And inside this project I have defined the rules as shown below. The project has a rulegroup (which groups rule logic). Inside the rulegroup there is one subgroup (you can have multiple ones for more complex logic). And inside the subgroup I have four rules. The rules inside the subgroup are connected using an "or" condition. The logic checks if the feature code in the data (row) is GRVC or GRVO or GRVP or GRVPN. If that is the case the rulegroup passes.

If you go back to the code displayed above you find a line: if(bre.getNumberOfGroupsFailed()==0). This checks, if the rulegroup passes (no failed rulegroups) and when it does, it continues to process the data in the hadoop mapreduce task. If the rulegroup fails, then that line is dropped and not processes in mapreduce.

So one can say that in this case the ruleengine and the rules are used as a simple filter mechanism to decide if to process the data or not.

Once the business logic is defined, you can export the complete project to a single file which can then be used by the ruleengine which is hooked into the mapreduce job.

Next thing is to run the mapreduce process. It processes the data from the geonames data file (CSV) and creates the result. The result is then copied from HDFS to the local filesystem. At this point a groovy script runs which formats the data, so that Highcharts can display the data. Highcharts makes it easy for developers to set up interactive charts in their web pages. To produce the chart I use an Apache Velocity template. This is a blueprint of a HTML page, but without the data (placeholders instead). Groovy processes/formats the data and then merges it with the Velocity template. The result is a web page as shown at the top of this article.

That's it! Now I can change the business logic to display different data and I do not need to touch my mapreduce code. You can create very complex business rule logic. For example find data rows with the placenames mentioned above but only for selected countries or in a specific geo fence or depending on the timezone. The possibilities are unlimited. The web tool allows you to create any complex logic. This is achieved by combining rules and subgroups using "and" and "or" conditions.

In the rule logic you can evaluate the data using checks such as: "is equal", "is greater/smaller", regular expressions, mathematical calculations, soundex algorithm, check is the data is not null, not empty, is in a list or is between certain values and much, much more.

And - very important - when running the data through the ruleengine, you can also update the data. You can apply actions which do certain calculations or modifications to the data such as: mathematical calculations (plus, minus, multiply, devide, sin, cos, tan, modulo, square root, etc), set values, sum field values, upper-/lowercase, percentage, substring, append/prepend values and more.

The ruleengine is extendable: you can create additonal checks that are used to evaluate the data and you can create additional actions that modify the data according to your needs.

If you think about this setup, you will see that it separates the IT code from the business logic. They are not mixed which is often the case. Because they are separated you can now have IT experts manage the map reduce job and the business rules logic is maintained by business experts. This is a clear separation of responsibilities and makes IT code cleaner. A major benefit for agility and quality!

The ruleengine and the web application are open source. So go ahead and integrate it into your Java projects, mapreduce tasks or your web application. Everything is available on github. Inclusing documentation, presentation and examples.

So I have my Raspberry Pi Hadoop Cluster running, as can be read in the previous post. Next step was to dynamically filter data in the mapreduce job using my Ruleengine.

As you can read in my previous posts: it is never a good idea to mix your IT code with the business rules. That's what the ruleengine is good for: define and handle the business rules outside of your IT code - if the rules change, you change them in a central web tool and you don't have to touch your IT code. In your code you simply reference the ruleengine and the file that contains all business rules. That makes the code clearer, is a proper division of responsibilities (rules managed by the business - IT code managed by IT), makes IT code changes more agile and at the end enhances the overall quality.

As the ruleengine is written in Java, it makes a perfect match with Apache Hadoop. You may update/manipulate the data using the ruleengine, but for now the goal is to simply filter data. As rows of data are processed by the map reduce job, the ruleengine runs against the data and filters out those rows, that are not applicable (according to the business rules).

For example you have a large file with data for different customers and you want to mapreduce the data but only for a few customers. Or - another example - you have data from various websites and you want to filter data that does not fullfill certain requirements such as type of web browser, URL or origin of the web page.

Instead of (hard-)coding these rules in your mapreduce job you would tackle the task as follows:

Use the Business Rules Web tool to define the business rules. This can be any complex logic. The tool helps you to define single rules and combine them to group of rules

Export the business rules (project) from the web tool

In your mapreduce job, reference the business rules engine and the exported file from the previous step, by adding a few lines of code.

Depending on the result of the ruleengine for the current data (passed or failed) add the row to the context of the map part of the mapreduce job.

You won't have any "changing variables" (the business logic) in your code now. Run your mapreduce job and make sure that it works according to the requirements.

If you now need to change which data is filtered (the business comes with new requirements...), simply go back to the web tool and change the business rules (ideally have the business construct the rule logic). Then export them again and re-run the mapreduce job.

The creation of the rules file could also be automated, so that a new file is created and distributed at regular intervalls or based on a certain trigger/condition.

All parts, the web tool and the rule engine are open source, so you freely use them in your Java based projects. Go ahead and give it a try. Using the ruleengine will make your IT life easier, your code clearer and the business user has a cantral location where to manage or review the rules. In the web tool, the business user is NOT confronted with your IT code and it will be much easier for her/him to understand what the rules do because she/he is not distracted or disencouraged by a mixture of IT code and business rules. This is much more transparent to the user.

Recently I have spent some time on Apache Hadoop. I had a basic idea already of what it does, but I wanted to learn and understand more about how it works and then of course I wanted to try it out.

So I started reading documentation to understand the basic concepts. There is a lot of documentation out there and so I spent also considerable time finding the "right" documentation. "Right" in the sense of "recent" - there is a lot of info out there that is not one-to-one applicable for the recent versions.

I then downloaded Hadoop, installed it on my laptop and ran the famous wordcount mapreduce job. That is straight-ahead and worked quickly. And so I did further tests with more data and with larger CSV files.

As I am also a Java developer (besides other things), I created a project in Eclipse and started coding my first mapreduce job to get a feeling for how easy or not that is. The map and the reduce idea is easy enough but then there are a lot of details to watch out. But finally I had my own mapreduce job that reads from a CSV file and outputs the results of the defined key and the sum of a certain column in the file.

My next step was to modify the mapreduce code in a way, that I can dynamically select the columns from the CSV file that make up the key and the column that I want to use for summing. I did this by loading this information from a properties file external to the job. There is still a lot to learn, but at this point I had a working version of my own mapreduce job.

Hadoop is all about parallel processing, so the next step was to setup a cluster. I do not have multiple machines at hand, but I have a few Raspberry PI mini computers lying around, so I decided I give it a go. A portable Hadoop cluster seemed to be a nice idea.

Here is a picture of the final result:

The top PI (PI 3) is my namenode the other three (PI 2) are my datanodes. So the top one is the master who organizes and coordinates things and the other ones are doing the mapreduce work.

Again, there is lots of information on the net but also a lot of outdated information. Initially I had three PI's where the first one was the master but at the same time being also a datanode. I spent many days to find a working configuration. I arrived at a point where the cluster was active and working, but only the master was doing the mapreduce work. After a lot of research I found out that the other nodes were not communicating with the ResourceManger daemon. I updated the configuration and finally had a cluster where all PI's are working together.

And then finally I wanted to see how easy it is to add one more datanode to the cluster. I added more more PI and changed the configuration for the first one to not being a datanode.

What you see above is now working. A cluster with 4 computers based on Hadoop 2.7.3. I uploaded two CSV files into HDFS. Both contain 50000 lines. So I have 100000 lines to be processed. The mapreduce calculates the sum of one field and output the results for a key that is made up of three fields.

And below is a screenprint of the cluster working on the mapreduce job. On the right are 4 monitors (using nmon) for the nodes and on the left is the output from the madreduce job.

To get it all running and to gather all the information it took a while, but I have learnt a lot of things around Hadoop and the ecosystem as well. I will post more in the next time.

If anybody is interested in the configuration details, let me know - I am happy to share it.