Learn How to Install and Configure Open Source and Commercial Products!

Menu

Category Archives: How To…

Post navigation

Functional Programming
According to WIKI, In computer science, functional programming is a programming paradigm that treats computation as the evaluation of mathematical functions and avoids state and mutable data. It emphasizes the application of functions, in contrast to the imperative programming style, which emphasizes changes in state. Since there is no hidden dependency (via shared state), functions in the DAG can run anywhere in parallel as long as one is not an ancestor of the other. In other words, analyze the parallelism is much easier when there is no hidden dependency from shared state. Map/reduce is a special form of such a directed acyclic graph which is applicable in a wide range of use cases. It is organized as a “map” function which transform a piece of data into some number of key/value pairs. Each of these elements will then be sorted by their key and reach to the same node, where a “reduce” function is use to merge the values (of the same key) into a single result.Map Reduce

A way to take a big task and divide it into discrete tasks that can be done in parallel. Map / Reduce is just a pair of functions, operating over a list of data.

MapReduce is a patented software framework introduced by Google to support distributed computing on large data sets on clusters of computers.

The framework is inspired by map and reduce functions commonly used in functional programming,[3] although their purpose in the MapReduce framework is not the same as their original forms.Hadoop
A Large scale Batch Data Processing System.

It uses MAP-REDUCE for computation and HDFS for storage.

Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data. Hadoop was inspired by Google’s MapReduce and Google File System (GFS) papers.

It is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System and of MapReduce. HDFS is a highly fault-tolerant distributed file system and like Hadoop designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets.

Hadoop is an open source Java implementation of Google’s MapReduce algorithm along with an infrastructure to support distributing it over multiple machines. This includes it’s own filesystem ( HDFS Hadoop Distributed File System based on the Google File System) which is specifically tailored for dealing with large files. When thinking about Hadoop it’s important to keep in mind that the infrastructure it has is a huge part of it. Implementing MapReduce is simple. Implementing a system that can intelligently manage the distribution of processing and your files, and breaking those files down into more manageable chunks for processing in an efficient way is not.

HDFS breaks files down into blocks which can be replicated across it’s network (how many times it’s replicated it determined by your application and can be specified on a per file basis). This is one of the most important performance features and, according to the docs “…is a feature that needs a lot of tuning and experience.” You really don’t want to have 50 machines all trying to pull from a 1TB file on a single data node, at the same time, but you also don’t want to have it replicate a 1TB file out to 50 machines. So, it’s a balancing act.

v The JobTracker manages the MapReduce work. It “…is the central location for submitting and tracking MR jobs in a network environment.”

v Task Tracker and Data Node, which do the grunt work

Hadoop – NameNode, DataNode, JobTracker, TaskTracker

The JobTracker will first determine the number of splits (each split is configurable, ~16-64MB) from the input path, and select some TaskTracker based on their network proximity to the data sources, then the JobTracker send the task requests to those selected TaskTrackers.

Each TaskTracker will start the map phase processing by extracting the input data from the splits. For each record parsed by the “InputFormat”, it invoke the user provided “map” function, which emits a number of key/value pair in the memory buffer. A periodic wakeup process will sort the memory buffer into different reducer node by invoke the “combine” function. The key/value pairs are sorted into one of the R local files (suppose there are R reducer nodes).

When the map task completes (all splits are done), the TaskTracker will notify the JobTracker. When all the TaskTrackers are done, the JobTracker will notify the selected TaskTrackers for the reduce phase.

Each TaskTracker will read the region files remotely. It sorts the key/value pairs and for each key, it invoke the “reduce” function, which collects the key/aggregatedValue into the output file (one per reducer node).

Map/Reduce framework is resilient to crash of any components. The JobTracker keep tracks of the progress of each phases and periodically ping the TaskTracker for their health status. When any of the map phase TaskTracker crashes, the JobTracker will reassign the map task to a different TaskTracker node, which will rerun all the assigned splits. If the reduce phase TaskTracker crashes, the JobTracker will rerun the reduce at a different TaskTracker.Let’s try Hands on Hadoop
Objective of the tutorial is to set up multi-node Hadoop cluster using the Hadoop Distributed File System (HDFS) on Ubuntu Linux with the use of VMware Player.

The conf/masters file defines the namenodes of our multi-node cluster. In our case, this is just the master machine.

192.168.1.3

<HADOOP_INSTALL>/conf/slaves

This conf/slaves file lists the hosts, one per line, where the Hadoop slave daemons (datanodes and tasktrackers) will be run. We want both the master box and the slave box to act as Hadoop slaves because we want both of them to store and process data.

192.168.1.3

192.168.28.136

NOTE: Here 192.168.1.3 & 192.168.28.136 are the IP addresses of Physical Machine and Virtual machine respectively which may vary in your case. Just Enter IP Addresses in files and you are done!!!

Let’s enjoy the ride with Hadoop:

All Set for having “HANDS ON HADOOP”.

Formatting the name node

ON Physical Machine and Virtual Machine

The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your “cluster” (which includes only your local machine if you followed this tutorial). You need to do this the first time you set up a Hadoop cluster. Do not format a running Hadoop filesystem, this will cause all your data to be erased.

hadoop namenode -format

Starting the multi-node cluster

1.Start HDFS daemons

Run the command /bin/start-dfs.sh on the machine you want the (primary) namenode to run on. This will bring up HDFS with the namenode running on the machine you ran the previous command on, and datanodes on the machines listed in the conf/slaves file.

Physical Machine

Hadoop – start-dfs.sh

VM

Hadoop – DataNode on Slave Machine

1.Start MapReduce daemons

Run the command /bin/start-mapred.sh on the machine you want the jobtracker to run on. This will bring up the MapReduce cluster with the jobtracker running on the machine you ran the previous command on, and tasktrackers on the machines listed in the conf/slaves file.

Physical Machine

Hadoop – Start MapReduce daemons

VM

TaskTracker in Hadoop

Running a MapReduce job

Here’s the example input data I have used for the multi-node cluster setup described in this tutorial.

The multitenant architecture of Force.com consists of the following features:

•Shared infrastructure. Every customer (or tenant) of Force.com shares the same infrastructure. You are assigned a logical environment within the Force.com infrastructure.

•Single version There is only one version of the Force.com platform in production. The same platform is used to deliver applications of all sizes and shapes, used by 1 to 100,000 users.

•Continuous, zero-cost improvements When Force.com is upgraded to include new features or bug fixes, the upgrade is enabled in every customer’s logical environment with zero to minimal effort required.

•Infrastructure Explosure Force.com is targeted toward corporate application developers and independent software vendors. Unlike the other PaaS offerings, it does not expose developers directly to its own infrastructure

•Integration with other Technologies: FORCE.com integrates with other technologies using open standards such as SOAP and REST, the programming languages and metadata representations used to build applications are proprietary to Force.com.

•Relational Database

–To store and manage the business data. Data is stores in the objects.

Note:
30 day free trial doesn’t provide Workflow support else we can create full featured application. In Trial, we can create a Visualforce page but cannot enable Sites for our organization nor register our Force.com domain name and expose the Visualforce page we created as a public product catalog on the Web.

Workflow Support is available in Force.com One App: Start with one custom app- for your organization only.

Open Source ILS: Installation Guide for Koha on Ubuntu 11.10 or Ubuntu 11.04 or Ubuntu 10.10 or Ubuntu 10.04 LTS with MySQL 5

Open Source ILS Installation Guide for Koha on Ubuntu 11.10

According to Wikipedia.com, An integrated library system (ILS), also known as a library management system (LMS), is an enterprise resource planning system for a library, used to track items owned, orders made, bills paid, and patrons who have borrowed.

An ILS usually comprises a relational database, software to interact with that database, and two graphical user interfaces (one for patrons, one for staff). Most ILSes separate software functions into discrete programs called modules, each of them integrated with a unified interface. Examples of modules might include:

Koha is a free software library automation package. In use worldwide, its development is steered by a growing community of users collaborating to achieve their technology goals.

Features

Full-featured ILS. In use worldwide in libraries of all sizes, Koha is a true enterprise-class ILS with comprehensive functionality including basic or advanced options. Koha includes modules for circulation, cataloging, acquisitions, serials, reserves, patron management, branch relationships, and more.

Full text searching Koha uses an RDBMS coupled with an external search engine to provide powerful searching that is truly scalable

Library Standards Compliant. Koha is built using library standards and protocols that ensure interoperability between Koha and other systems and technologies, while supporting existing workflows and tools.

Web-based Interfaces. Koha’s OPAC, circ, management and self-checkout interfaces are all based on standards-compliant World Wide Web technologies–XHTML, CSS and Javascript–making Koha a truly platform-independent solution.

No Vendor Lock-in. It is an important part of the free software promise that there is no vendor lock-in: libraries are free to install and use Koha themselves if the have the in-house expertise or to purchase support or development services from the best available sources. Libraries should be free to change support company and export their data at any time, make sure your support company allows this.

Koha is free software and is licensed under the GNU General Public License, either version 2 of the License, or (at your option) any later version.
All commands can be performed as a system user with sudo privileges, as indicated or by running the command directly as root.

Note: you may need to run CPAN initialization if you’ve not run cpan before: /etc/perl/CPAN/Config.pm initialized.

CPAN is the world-wide archive of perl resources. It consists of about 100 sites that all replicate the same contents all around the globe. Many countries have at least one CPAN site already. The resources found on CPAN are easily accessible with the CPAN.pm module. If you want to use CPAN.pm, you have to configure it properly. If you do not want to enter a dialog now, you can answer ‘no’ to this question and I’ll try to autoconfigure. (Note: you can revisit this dialog anytime later by typing ‘o conf init’ at the cpan prompt.)

Are you ready for manual configuration? [yes]

When the configuration is completed CPAN will install the Perl modules.

Koha – Install Perl dependencies

Koha – Install Perl dependencies

2. Configuration of dependencies

2.1 Update root MySQL password (if dselect didn’t do it for you already)

You must be sure you’re using the XML::LibXML SAX parser, not Expat or PurePerl, both of which have outstanding bugs with pre-composed characters. You can test your SAX parser by running:

$ cd koha

$ misc/sax_parser_print.pl

You should see something like::

XML::LibXML::SAX::Parser=HASH(0x81fe220)

If you’re using PurePerl or Expat, you’ll need to edit your ini file, typically located at:

/etc/perl/XML/SAX/ParserDetails.ini

You will need to move the entire section for ‘[XML::LibXML::SAX::Parser]’ to the bottom of the ini file.

Koha – Test your SAX Parser

Koha – Test your SAX Parser

2.4 Install DBD::mysql Perl module

In order to handle UTF-8 correctly, Koha requires at least version 4.004 of the DBD::mysql Perl module. However, Debian Etch has a stable package only for version 3.0008, so it is necessary to install the module from CPAN. DBD::mysql’s test suite needs to use a MySQL ‘test’ DB which doesn’t exist anymore. So there are two options to install DBD::mysql:

2.4.1 Install without test suite

Force install DBD::mysql:

$ sudo cpan

cpan> force install DBD::mysql

Koha – Install DBD::mysql Perl module

Koha – Install DBD::mysql Perl module

2.4.2 Create test database in order to install DBD::mysql

Because of DBD::mysql’s test suite, it is necessary to temporarily create a test database and user:

Note: you may still see the usual Apache default site if your VirtualHost configuration isn’t correct. The command “sudo a2dissite default” may be a quick fix, but may have side-effects. See the Apache HTTPD manual section on virtual hosts for full instructions.

5. Configure and start Zebra

Note: it’s recommended that you daemonize the Zebra process and add it to your startup profile. For a non-production test/development installation, running Zebra from the command line can be useful. Pick from the two available options below, or roll your own 🙂

Note: it’s also recommended that you create a Koha system user, which you will have specified during the install process. Alternatively, Zebra can be configured to run as the root user.

To add a user do:

$ sudo adduser koha

Option 1: run the Zebra processes from the command line:

1.1 Zebra Search Server

This process send responses to search requests sent by Koha or Z39.50/SRU/SRW clients.

$ sudo -u ${KOHA_USER} zebrasrv -f /etc/koha/koha-conf.xml

(note that the path to koha-conf.xml may be different depending on your installation choices)

Note: the user you run Zebra as will be the only user with write permission on the Zebra index; in development mode, you may wish to use your system user.

1.2 Zebra Indexer

Added/updated/deleted records in Koha MySQL database must be indexed into Zebra. A specific script must be launched each time a bibliographic or an authority record is edited.

$ sudo -u ${KOHA_USER} misc/migration_tools/rebuild_zebra -z -b -a

NOTE: This script should be run as the kohauser (the default is ‘koha’).

Option 2: run the Zebra process as a daemon, and add to startup process:

Note that references to $SCRIPT_DIR refer to the directory where Koha’s command-line scripts are installed, e.g., /usr/share/koha/bin.