mbolic-functions
checking consistency of all components of python development environment... no
configure: error: in `/home/jagat/development/code/berkley/mesos':
configure: error:
Could not link test program to Python. Maybe the main Python library has been
installed in some non-standard library path. If so, pass it to configure,
via the LDFLAGS environment variable.
Example: ./configure LDFLAGS="-L/usr/non-standard-path/python/lib"
============================================================================
ERROR!
You probably have to install the development version of the Python package
for your distribution. The exact name of this package varies among them.
============================================================================

See `config.log' for more details
jagat@nanak-P570WM:~/development/code/berkley/mesos$ apt-cache search python27

Solution
Install python-dev

sudo apt-get install python-dev

checking for curl_global_init in -lcurl... no
configure: error: cannot find libcurl
-------------------------------------------------------------------
You can avoid this with --without-curl, but it will mean executor
and task resources cannot be downloaded over http.
-------------------------------------------------------------------

Solution
sudo apt-get instal libcurl4-openssl-dev

checking whether -pthread is sufficient with -shared... yes
checking for backtrace in -lunwind... no
configure: error: failed to determine linker flags for using Java (bad JAVA_HOME or missing support for your architecture?)

Solution
Download JDK from

set env variable as

export JAVA_HOME=/home/jagat/development/tools/jdk1.6.0_45

export PATH=$PATH:$JAVA_HOME/bin

----

And in end it fails wiht this message

cc1plus: all warnings being treated as errors
make[2]: *** [sched/libmesos_no_3rdparty_

Now if some processing has to be done incrementally by changing some variable over across all data set
The mapreduce will again start from reading from disk and if you the processing 100 times , it will do it 100 times * 2 ( for intermediary processing also )

The Spark solves the following typical use cases where same processing is applied to datasets with varying variable inputs.

Two typical usecases where Spark shines are:

Iterative jobs
Many common machine learning algorithms apply a function repeatedly to the same dataset
to optimize a parameter (e.g., through gradient descent).While each iteration can be expressed as a
MapReduce/Dryad job, each job must reload the data from disk, incurring a significant performance penalty.

Interactive analysis
Hadoop is often used to perform ad-hoc exploratory queries on big datasets, through
SQL interfaces such as Pig and Hive. Ideally, a user would be able to load a dataset of interest into
memory across a number of machines and query it repeatedly.

To achieve these goals, Spark introduces an abstraction called resilient distributed datasets (RDDs).
An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost.

RDDs achieve fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to be able to rebuild just that partition.

Spark provides two main abstractions for parallel programming:

resilient distributed datasets and
parallel operations on these datasets (invoked by passing a function to apply on a dataset).
Which are based on typical functional programming concepts of map , flatmap , filters etc

In addition, Spark supports two restricted types of shared variables that can be used in functions running
on the cluster

Variables in Spark

Broadcast variables: Read Only variable

Accumulators: These are variables that workers can only “add” to using an associative operation

Credits
This post is nothing but reproduce of work done here at AmpLabs. If you want latest and detailed read i suggest you to go there.
Bigdata world is so beautiful , the research in this field is driving at such a fast pace that eventually BigData is no more synonymous with long running queries , its becoming LiveData everywhere.
The work compares the computation time of Redshift , Hive , Impala , Shark with different types of queries

The performance of Shark in memory has been consistent in all 4 types. It would be interesting to see the comparison when Hive 0.11 is used as it also adds few performance improvements driven by work at Hortonworks.

Monolithic schedulers use a single, centralized scheduling algorithm for all jobs (our existing
scheduler is one of these).

Two-level
schedulers have a single active resource manager that offers compute resources to multiple parallel, independent “scheduler frameworks”, as in Mesos and Hadoop-on-Demand (HPC)

The paper classifies Yarn as Monolithic scheduler and Mesos onto Two level scheduler.

It is an interesting read and also raises one question for Yarn

I quote

It might appear that YARN is a two-level scheduler, too. In YARN, resource requests from per-job
application masters are sent to a single global scheduler in the resource master , which allocates resources on various machines, subject to application-specified constraints. But the application masters provide job-management services, not scheduling, so YARN is effectively a monolithic scheduler architecture.
At the time of writing, YARN only supports one resource type (fixed-sized memory chunks). Our experience suggests that it will eventually need a rich API to the resource mastin order to cater for diverse application requirements, including multiple resource dimensions, constraints, and placement choices for failure-tolerance.

Although YARN application masters can request resources on particular machines,it is unclear how they acquire and maintain the state needed to make such placement decisions.

Google seems to be drifting away from Yarn , unlike its counterpart Yahoo

Architecturally how
does YARN compare with Mesos?
Conceptually YARN and Mesos address similar requirements. They enable
organizations to pool and share horizontal compute resources across a multitude
of workloads. YARN was architected specifically as an evolution of Hadoop 1.x.
YARN thus tightly integrates with HDFS, MapReduce and Hadoop security.